A few years ago, I got interested in Forth due to its connection to RPN calculators but lost interest because of the huge hit to performance a stack-based system takes on an architecture like 6502. Recently, my interest was piqued again during a conversation in the #Forth IRC channel on FreeNode about Mecrisp-Across, which uses an ARM-based microcontroller board to produce optimized assembly for the MSP430. Because the compiler tries to keep values in registers instead of the stack, it has the potential to reclaim some of the performance lost in most Forths. Some of these optimizations might be useful for a 6502-based Forth, so I tried to understand how optimizations in Mecrisp work.
The first step was getting Mecrisp running on a PC. The author provides versions that are compiled for ARM with Linux system calls, so you can use QEMU for emulation under Linux. Getting this working on Windows was a real challenge. One Debian image I tried emulating with QEMU crashed the emulator outright. Another image booted into Linux but refused to install QEMU complaining about needing the Linux CD in the drive before it would install any packages. Getting Ubuntu working on VirtualBox under Windows 10 and installing QEMU on it wasn't too tough, but the Mecrisp image wouldn't run until I got the right type of QEMU (there are several). The right combination was VirtualBox, an Ubuntu image from OSboxes, and QEMU-user-static.
Mecrisp-Across has a host mode for compiling and other high level tasks and a target mode for defining Forth words for the MSP430. The documentation shows that you can only compile once after typing in your words and then you have to erase everything before compiling again. Maybe this is reasonable considering how little memory there is to work with. You can save some time by saving all your commands to a file then using cat and a pipe to direct the file contents to QEMU:
cat test.txt | qemu-arm-static mecrisp-stellaris-linux-with-mecrisp-across
The file itself is pretty simple to set up, and you can generate a binary with just a few lines.
new
target
: main 2 2 + ; \ main program
host
$FFFE vector main
crosscompile
disimage
The last line shows a disassembly of the compiled program, which is the really interesting part. The first question is whether the compiler folds constants and if so, if it does it between words. Here is the first program I tried:
: t1 3 + ;
: t0 5 * t1 ;
: main 7 t0 ;
This should calculate 7 * 5 + 3 and leave 38 on the stack. Here is the generated assembly:
F800: 40B2 mov.w #5A80h, &120h
F802: 5A80
F804: 0120
F806: 4031 mov.w #280h, r1
F808: 0280
F80A: 4034 mov.w #260h, r4
F80C: 0260
F80E: 8324 sub.w #2h, r4
F810: 4684 mov.w r6, 0h(r4)
F812: 0000
F814: 4036 mov.w #26h, r6
F816: 0026
F818: 3FFF jmp F818
The green part disables the watchdog timer and sets up the two stacks. F80E - F812 saves r6 and the next line loads r6 with the calculated value 38, so it is folding contants even between words! One thing we can see is that mov.w #260h, r4 and sub.w #2h, r4 could be combined into mov.w #25Eh, r4, although it might not be optimized if the setup code is just boilerplate. Also, the line saving r6 to the stack could be eliminated since r6 is uninitialized at that point.
Does it also fold constants when an if statement is involved?
: t2 dup 38 = if 5 then ;
: t1 3 + t2 ;
: t0 5 * t1 ;
: main 7 t0 ;
(setup removed)
F80E: 8224 sub.w #4h, r4
F810: 40B4 mov.w #26h, 0h(r4)
F812: 0026
F814: 0000
F816: 4684 mov.w r6, 2h(r4)
F818: 0002
F81A: 4036 mov.w #5h, r6
F81C: 0005
F81E: 3FFF jmp F81E
This saves the calculated 38 to the stack and keeps 5 in TOS as expected.
Does constant folding also work in loops? This code sums all the numbers from 1 to n, where here n is 7:
: sum 0 swap 1+ 1 do i + loop ;
: main 7 sum ;
(setup removed)
F80E: 4307 mov.w #0h, r7
F810: 4318 mov.w #1h, r8
F812: 4239 mov.w #8h, r9
F814: 480A mov.w r8, r10
F816: 570A add.w r7, r10
F818: 5318 add.w #1h, r8
F81A: 9908 cmp.w r9, r8
F81C: 4A07 mov.w r10, r7
F81E: 23FA jnz F814
F820: 8324 sub.w #2h, r4
F822: 4684 mov.w r6, 0h(r4)
F824: 0000
F826: 4706 mov.w r7, r6
F828: 3FFF jmp F828
Other than unnecessarily caching r6 on lines F820 - F824, this is close to the version you would write in assembly, which is impressive. It doesn't use the stack at all. Unfortunately, it does not collapse constants in this case, even though there is enough information to calculate the answer at compile time.
The last thing to check is how it handles register allocations in situations where the size of the stack cannot be determined at compile time.
: main 3 4 P1IN @ 5 = IF 6 THEN * ;
Here, stack starts with [3 4] and depending on the input of P1IN may become [3 4 6]. Therefore, the * instruction can't predict whether it will multiply 3 and 4 or 4 and 6. Here is the assembly:
(setup removed)
F80E: 90B2 cmp.w #5h, &20h
F810: 0005
F812: 0020
F814: 4227 mov.w #4h, r7
F816: 4038 mov.w #3h, r8
F818: 0003
F81A: 2007 jnz F82A
F81C: 8324 sub.w #2h, r4
F81E: 4684 mov.w r6, 0h(r4)
F820: 0000
F822: 4806 mov.w r8, r6
F824: 4708 mov.w r7, r8
F826: 4037 mov.w #6h, r7
F828: 0006
F82A: 4309 mov.w #0h, r9
F82C: 1008 rrc.w r8
F82E: 2801 jnc F832
F830: 5709 add.w r7, r9
F832: 5707 add.w r7, r7
F834: 23FB jnz F82C
F836: 8324 sub.w #2h, r4
F838: 4684 mov.w r6, 0h(r4)
F83A: 0000
F83C: 4906 mov.w r9, r6
F83E: 3FFF jmp F83E
Lots of neat things are happening here. P1IN @ 5 = reduces down to just one instruction, which is really cool! The multiplication is done with the Russian peasant algorithm. The main thing I was wondering about is what happens to register assignments when the top of the stack is unpredictable. In this case, register assignments are shifted by one, accomplishing the same function as a stack push. This ends up being slower than the code you would write if you weren't relying on the stack but a lot faster than relying on the stack directly like other Forths do.
All in all I think the optimizations that Mecrisp does are very impressive, and it was interesting to see how some of them work. It gave me a lot of ideas for how I would try to do optimizing if I made a language like Forth for one of my calculators.
All in all I think the optimizations that Mecrisp does are very impressive, and it was interesting to see how some of them work. It gave me a lot of ideas for how I would try to do optimizing if I made a language like Forth for one of my calculators.
No comments:
Post a Comment