As my last post explains, my goal for 2023 is to finish enough projects to give a presentation at HHC 2023. One of the biggest jobs will be finishing the 6507 Graphing Calculator I've been working on the past few years. The latest stage of the project was rewriting part of the firmware in Forth to save space, which turned out to be a failure.
When I last worked on this project, the firmware had reached 9.3K which well exceeds the calculator's 8K EEPROM. This wasn't so concerning since it seemed plausible at the time to implement a tiny Forth core and shrink the firmware by rewriting parts of it in Forth. Also, some parts of the existing firmware, which is all written in assembly, could be improved to save space. With this plan in mind, I started implementing a preprocessor in Python that scans all of the project's assembly files for Forth code and converts it into bytes to embed in the assembly source code. The Forth core uses token threading so each Forth word in the source produces just one byte rather than two as in direct and indirect threading or three with subroutine threading. The Forth system is initiated with a BRK instruction which jumps to the software interrupt handler and begins interpreting the bytes following the BRK. This saves two bytes for every invocation of the Forth system over using JMP or JSR. One of the tokens in the Forth system is assigned the same number as the BRK op code and increments a counter when executed. A corresponding token at the end of each function decrements the counter and switches from Forth back to assembly when the counter reaches zero. With this setup, the program can jump from assembly or Forth to a function written in Forth and continue executing seamlessly.
Forth code embedded in 6502 assembly |
EXTERN
SCREEN_ADDRESS+CHAR_HEIGHT*6.5*SCREEN_WIDTH AS STACK_LINE_Y
END
This functionality was extremely useful, especially in situations with forward references where the typical Forth words [ and ] wouldn't work.
Implementing this Forth system let me try out some ideas not usually found in traditional Forths:
- Constants can be marked as only 8 bits wide with CONST8 which saves a byte in the source.
- DO loops use an 8-bit counter by default which really speeds things up compared to a 16-bit counter.
- 16-bit constants are stored in a table managed by the preprocessor and referred to in the source by an 8-bit ID. This takes one extra byte compared to storing the 16-bit constant directly in the source instead of the table but breaks even if the constant is used at least twice and saves an extra byte every time the constant is used after that.
Forth VM debug output (shortened) |
The debug information also has a section for "Floating Point Ops" as seen above. This was an idea I experimented with for using bytecode for floating point registers too but didn't finish since there wasn't enough space.
Rather than implementing all of the Forth words like DUP and OVER that might be useful eventually, I started by rewriting the functions for graphics output in Forth and only defining new words as necessary to keep the Forth core as small as possible. Graphics output seemed like a natural place to start since Forth sacrifices a lot of speed which may not be very noticeable since the screen doesn't update often. The minimal Forth core for this quickly grew to 1K, which was much larger than expected. It seems the 6502 is not especially size-efficient compared to similar architectures. This paper on code density, for example, shows that the Z80, which was contemporary with the 6502, is generally much denser. People often cite the 6502 as a forerunner to RISC-style architectures which never made much sense to me with it's small number of registers, variable instruction size, and many addressing modes. However, it makes more sense when you consider that some operations require a lot of instructions.
The recoded Forth functions for the graphics output were generally 50-60% the size of the assembly equivalent. This provides an easy metric for estimating how the project might proceed. Adding the Forth core brought the firmware size to about 10.3K. Assuming the best case of 50% compression for all the Forth code, 4.5K of assembly would need to be recoded into Forth to get the existing firmware to fit into the 8K EEPROM. This would leave less than 5K for the parts that need to be in assembly. Unfortunately, the Forth code runs 30-40 times slower than the equivalent assembly which is too slow for graphics output and for most other things as well. The floating point routines, for example, need to be fast and should be in assembly. On top of this, extra functionality needs to be added like garbage collection, graphing, and more trig functions which would eat into the remaining space. There's just no way to fit everything I planned on doing into 8K of 6502 code.
There are a few ways to proceed if recoding the firmware in Forth won't work. One idea is to reduce the functionality to fit the available 8K of memory. This idea appeals to me since it would keep more to the spirit of the original competition. Support for strings and pointers would be easy to remove, for example, to save space. On the other hand, it would be a shame to remove functionality from the firmware that is already running on the emulator on my website. The other alternative is to increase the size of the EEPROM. This will let me finish the project writing everything in assembly. Since there aren't any 16K chips readily available, it will use two 8K chips or one 32K chip. In either case, I'll need to make a new circuit board.
After it became clear that the Forth system wouldn't work, I reverted everything back to the pure assembly version. The project has relied on my 6502 Optimizing Assembler from the beginning for passing arguments and allocating local variables for functions in zero page. The current incarnation of the optimizer uses NASM for its macro capabilities to process the 6502 assembly then uses the result to assign memory addresses. The downside of this is that the NASM macros are long and complicated which leads to enormous assembly listings and longer assembly times. Since the optimizer is only assigning zero page addresses for this project, I made a small Python script to replace the optimizer. It's less than 600 lines of code but enough to eliminate the reliance on NASM and speed the assembly times up to less than a second. The assembly listings are also normal size now since the new optimizer only needs a handful of short macros. After this was done, I moved the project from Windows to Linux where I prefer to program these days.
No comments:
Post a Comment