Sunday, December 8, 2019

Optimized MSP430 Assembly in Mecrisp Forth


A few years ago, I got interested in Forth due to its connection to RPN calculators but lost interest because of the huge hit to performance a stack-based system takes on an architecture like 6502. Recently, my interest was piqued again during a conversation in the #Forth IRC channel on FreeNode about Mecrisp-Across, which uses an ARM-based microcontroller board to produce optimized assembly for the MSP430. Because the compiler tries to keep values in registers instead of the stack, it has the potential to reclaim some of the performance lost in most Forths. Some of these optimizations might be useful for a 6502-based Forth, so I tried to understand how optimizations in Mecrisp work.

The first step was getting Mecrisp running on a PC. The author provides versions that are compiled for ARM with Linux system calls, so you can use QEMU for emulation under Linux. Getting this working on Windows was a real challenge. One Debian image I tried emulating with QEMU crashed the emulator outright. Another image booted into Linux but refused to install QEMU complaining about needing the Linux CD in the drive before it would install any packages. Getting Ubuntu working on VirtualBox under Windows 10 and installing QEMU on it wasn't too tough, but the Mecrisp image wouldn't run until I got the right type of QEMU (there are several). The right combination was VirtualBox, an Ubuntu image from OSboxes, and QEMU-user-static.

Mecrisp-Across has a host mode for compiling and other high level tasks and a target mode for defining Forth words for the MSP430. The documentation shows that you can only compile once after typing in your words and then you have to erase everything before compiling again. Maybe this is reasonable considering how little memory there is to work with. You can save some time by saving all your commands to a file then using cat and a pipe to direct the file contents to QEMU:

cat test.txt | qemu-arm-static mecrisp-stellaris-linux-with-mecrisp-across

The file itself is pretty simple to set up, and you can generate a binary with just a few lines. 

new
target

: main 2 2 + ; \ main program

host
$FFFE vector main
crosscompile
disimage

The last line shows a disassembly of the compiled program, which is the really interesting part. The first question is whether the compiler folds constants and if so, if it does it between words. Here is the first program I tried:

: t1 3 + ;
: t0 5 * t1 ;
: main 7 t0 ;

This should calculate 7 * 5 + 3 and leave 38 on the stack. Here is the generated assembly:

F800: 40B2  mov.w #5A80h, &120h
F802: 5A80  
F804: 0120  
F806: 4031  mov.w #280h, r1
F808: 0280  
F80A: 4034  mov.w #260h, r4
F80C: 0260  
F80E: 8324  sub.w #2h, r4
F810: 4684  mov.w r6, 0h(r4)
F812: 0000  
F814: 4036  mov.w #26h, r6
F816: 0026  
F818: 3FFF  jmp F818

The green part disables the watchdog timer and sets up the two stacks. F80E - F812 saves r6 and the next line loads r6 with the calculated value 38, so it is folding contants even between words! One thing we can see is that mov.w #260h, r4 and sub.w #2h, r4 could be combined into mov.w #25Eh, r4, although it might not be optimized if the setup code is just boilerplate. Also, the line saving r6 to the stack could be eliminated since r6 is uninitialized at that point. 

Does it also fold constants when an if statement is involved?


: t2 dup 38 = if 5 then ;
: t1 3 + t2 ;
: t0 5 * t1 ;
: main 7 t0 ;

(setup removed)
F80E: 8224  sub.w #4h, r4
F810: 40B4  mov.w #26h, 0h(r4)
F812: 0026  
F814: 0000  
F816: 4684  mov.w r6, 2h(r4)
F818: 0002  
F81A: 4036  mov.w #5h, r6
F81C: 0005  
F81E: 3FFF  jmp F81E

This saves the calculated 38 to the stack and keeps 5 in TOS as expected.

Does constant folding also work in loops? This code sums all the numbers from 1 to n, where here n is 7:


: sum 0 swap 1+ 1 do i + loop ;
: main 7 sum ;

(setup removed)
F80E: 4307  mov.w #0h, r7
F810: 4318  mov.w #1h, r8
F812: 4239  mov.w #8h, r9
F814: 480A  mov.w r8, r10
F816: 570A  add.w r7, r10
F818: 5318  add.w #1h, r8
F81A: 9908  cmp.w r9, r8
F81C: 4A07  mov.w r10, r7
F81E: 23FA  jnz F814
F820: 8324  sub.w #2h, r4
F822: 4684  mov.w r6, 0h(r4)
F824: 0000  
F826: 4706  mov.w r7, r6
F828: 3FFF  jmp F828

Other than unnecessarily caching r6 on lines F820 - F824, this is close to the version you would write in assembly, which is impressive. It doesn't use the stack at all. Unfortunately, it does not collapse constants in this case, even though there is enough information to calculate the answer at compile time.

The last thing to check is how it handles register allocations in situations where the size of the stack cannot be determined at compile time. 

: main 3 4 P1IN @ 5 = IF 6 THEN * ;

Here, stack starts with [3 4] and depending on the input of P1IN may become [3 4 6]. Therefore, the * instruction can't predict whether it will multiply 3 and 4 or 4 and 6. Here is the assembly:

(setup removed)

F80E: 90B2  cmp.w #5h, &20h
F810: 0005  
F812: 0020  
F814: 4227  mov.w #4h, r7
F816: 4038  mov.w #3h, r8
F818: 0003  
F81A: 2007  jnz F82A
F81C: 8324  sub.w #2h, r4
F81E: 4684  mov.w r6, 0h(r4)
F820: 0000  
F822: 4806  mov.w r8, r6
F824: 4708  mov.w r7, r8
F826: 4037  mov.w #6h, r7
F828: 0006  
F82A: 4309  mov.w #0h, r9
F82C: 1008  rrc.w r8
F82E: 2801  jnc F832
F830: 5709  add.w r7, r9
F832: 5707  add.w r7, r7
F834: 23FB  jnz F82C
F836: 8324  sub.w #2h, r4
F838: 4684  mov.w r6, 0h(r4)
F83A: 0000  
F83C: 4906  mov.w r9, r6
F83E: 3FFF  jmp F83E

Lots of neat things are happening here. P1IN @ 5 = reduces down to just one instruction, which is really cool! The multiplication is done with the Russian peasant algorithm. The main thing I was wondering about is what happens to register assignments when the top of the stack is unpredictable. In this case, register assignments are shifted by one, accomplishing the same function as a stack push. This ends up being slower than the code you would write if you weren't relying on the stack but a lot faster than relying on the stack directly like other Forths do.

All in all I think the optimizations that Mecrisp does are very impressive, and it was interesting to see how some of them work. It gave me a lot of ideas for how I would try to do optimizing if I made a language like Forth for one of my calculators.

No comments:

Post a Comment