Joldosh's Electronics Projects: Optimized MSP430 Assembly in Mecrisp Forth

A few years ago, I got interested in Forth due to its connection to RPN calculators but lost interest because of the huge hit to performance a stack-based system takes on an architecture like 6502. Recently, my interest was piqued again during a conversation in the #Forth IRC channel on FreeNode about Mecrisp-Across, which uses an ARM-based microcontroller board to produce optimized assembly for the MSP430. Because the compiler tries to keep values in registers instead of the stack, it has the potential to reclaim some of the performance lost in most Forths. Some of these optimizations might be useful for a 6502-based Forth, so I tried to understand how optimizations in Mecrisp work.

The first step was getting Mecrisp running on a PC. The author provides versions that are compiled for ARM with Linux system calls, so you can use QEMU for emulation under Linux. Getting this working on Windows was a real challenge. One Debian image I tried emulating with QEMU crashed the emulator outright. Another image booted into Linux but refused to install QEMU complaining about needing the Linux CD in the drive before it would install any packages. Getting Ubuntu working on VirtualBox under Windows 10 and installing QEMU on it wasn't too tough, but the Mecrisp image wouldn't run until I got the right type of QEMU (there are several). The right combination was VirtualBox, an Ubuntu image from OSboxes, and QEMU-user-static.

Mecrisp-Across has a host mode for compiling and other high level tasks and a target mode for defining Forth words for the MSP430. The documentation shows that you can only compile once after typing in your words and then you have to erase everything before compiling again. Maybe this is reasonable considering how little memory there is to work with. You can save some time by saving all your commands to a file then using cat and a pipe to direct the file contents to QEMU:

cat test.txt | qemu-arm-static mecrisp-stellaris-linux-with-mecrisp-across

The file itself is pretty simple to set up, and you can generate a binary with just a few lines.

new
target

: main 2 2 + ; \ main program

host
$FFFE vector main
crosscompile
disimage

The last line shows a disassembly of the compiled program, which is the really interesting part. The first question is whether the compiler folds constants and if so, if it does it between words. Here is the first program I tried:

: t1 3 + ;
: t0 5 * t1 ;
: main 7 t0 ;

This should calculate 7 * 5 + 3 and leave 38 on the stack. Here is the generated assembly:

F800: 40B2 mov.w #5A80h, &120h

F802: 5A80

F804: 0120

F806: 4031 mov.w #280h, r1

F808: 0280

F80A: 4034 mov.w #260h, r4

F80C: 0260

F80E: 8324 sub.w #2h, r4

F810: 4684 mov.w r6, 0h(r4)

F812: 0000

F814: 4036 mov.w #26h, r6

F816: 0026

F818: 3FFF jmp F818

The green part disables the watchdog timer and sets up the two stacks. F80E - F812 saves r6 and the next line loads r6 with the calculated value 38, so it is folding contants even between words! One thing we can see is that mov.w #260h, r4 and sub.w #2h, r4 could be combined into mov.w #25Eh, r4, although it might not be optimized if the setup code is just boilerplate. Also, the line saving r6 to the stack could be eliminated since r6 is uninitialized at that point.

Does it also fold constants when an if statement is involved?

: t2 dup 38 = if 5 then ;

: t1 3 + t2 ;

: t0 5 * t1 ;

: main 7 t0 ;

(setup removed)

F80E: 8224 sub.w #4h, r4

F810: 40B4 mov.w #26h, 0h(r4)

F812: 0026

F814: 0000

F816: 4684 mov.w r6, 2h(r4)

F818: 0002

F81A: 4036 mov.w #5h, r6

F81C: 0005

F81E: 3FFF jmp F81E

This saves the calculated 38 to the stack and keeps 5 in TOS as expected.

Does constant folding also work in loops? This code sums all the numbers from 1 to n, where here n is 7:

: sum 0 swap 1+ 1 do i + loop ;

: main 7 sum ;

(setup removed)

F80E: 4307 mov.w #0h, r7

F810: 4318 mov.w #1h, r8

F812: 4239 mov.w #8h, r9

F814: 480A mov.w r8, r10

F816: 570A add.w r7, r10

F818: 5318 add.w #1h, r8

F81A: 9908 cmp.w r9, r8

F81C: 4A07 mov.w r10, r7

F81E: 23FA jnz F814

F820: 8324 sub.w #2h, r4

F822: 4684 mov.w r6, 0h(r4)

F824: 0000

F826: 4706 mov.w r7, r6

F828: 3FFF jmp F828

Other than unnecessarily caching r6 on lines F820 - F824, this is close to the version you would write in assembly, which is impressive. It doesn't use the stack at all. Unfortunately, it does not collapse constants in this case, even though there is enough information to calculate the answer at compile time.

The last thing to check is how it handles register allocations in situations where the size of the stack cannot be determined at compile time.

: main 3 4 P1IN @ 5 = IF 6 THEN * ;

Here, stack starts with [3 4] and depending on the input of P1IN may become [3 4 6]. Therefore, the * instruction can't predict whether it will multiply 3 and 4 or 4 and 6. Here is the assembly:

(setup removed)

F80E: 90B2 cmp.w #5h, &20h

F810: 0005

F812: 0020

F814: 4227 mov.w #4h, r7

F816: 4038 mov.w #3h, r8

F818: 0003

F81A: 2007 jnz F82A

F81C: 8324 sub.w #2h, r4

F81E: 4684 mov.w r6, 0h(r4)

F820: 0000

F822: 4806 mov.w r8, r6

F824: 4708 mov.w r7, r8

F826: 4037 mov.w #6h, r7

F828: 0006

F82A: 4309 mov.w #0h, r9

F82C: 1008 rrc.w r8

F82E: 2801 jnc F832

F830: 5709 add.w r7, r9

F832: 5707 add.w r7, r7

F834: 23FB jnz F82C

F836: 8324 sub.w #2h, r4

F838: 4684 mov.w r6, 0h(r4)

F83A: 0000

F83C: 4906 mov.w r9, r6

F83E: 3FFF jmp F83E

Lots of neat things are happening here. P1IN @ 5 = reduces down to just one instruction, which is really cool! The multiplication is done with the Russian peasant algorithm. The main thing I was wondering about is what happens to register assignments when the top of the stack is unpredictable. In this case, register assignments are shifted by one, accomplishing the same function as a stack push. This ends up being slower than the code you would write if you weren't relying on the stack but a lot faster than relying on the stack directly like other Forths do.

All in all I think the optimizations that Mecrisp does are very impressive, and it was interesting to see how some of them work. It gave me a lot of ideas for how I would try to do optimizing if I made a language like Forth for one of my calculators.

Joldosh's Electronics Projects

Sunday, December 8, 2019

Optimized MSP430 Assembly in Mecrisp Forth

No comments:

Post a Comment