Back in March I finished the microcontroller comparison of a few chips I was considering using. The 8051 based chips all fared poorly in BCD calculations, which is the main thing they would be used for in another calculator project. Some other 8051 fans pointed out that tests like mine test the compiler more than the chip and the SDCC compiler does a lot less optimizing than GCC. So, I decided to recode the BCD Add routine from the test in assembly. At the beginning I didn't consider this because I thought I would only ever write C for the 8051, but after writing assembly for the 6502, I was excited to try it for the 8051 too.
For this test I used the MCU 8051 IDE simulator. It is really impressive! Being able to see all the RAM and registers helped a lot. It is similar in some ways to what I hoped to accomplish with my 6502 Virtual Trainer. When I translated the original BCD Add function from C, I left out the parts used for subtraction, since they weren't used in the test. Also, some of the values that don't change inside loops are calculated once before the loop and stored for use later. This makes sense because a good C compiler would do this for you. All values are stored in the internal DATA, although this would be far too small for real calculations. The simulator can also work with SDCC, so I was able to simulate the function in C after commenting out the parts for subtraction.
The assembly took a simulated 440us to execute. The C version took between 1365 and 1732us depending on memory model and whether the BCD data was stored in DATA or XDATA. Curiously, storing it in XDATA was always faster. The medium memory model was faster than the large memory model. It seems then that hand-coding the routine in assembly is 3-4 times faster. To be fair, the routines for other architectures might also show a speed up if they were rewritten. Compared to the original numbers of the test, a 3x speedup would make the DS89C450 faster than the MSP430 for this routine. The difference might actually be a little bigger since this test used an upgraded version of SDCC that might optimize better. With that in mind, an 8051-based calculator with external bus (coded in assembly) would outperform an MSP430 with SPI SRAM. That is a bit of a surprise after the results of the first test.
The LPC1114 could also be sped up a little. In the original test the chip was configured to run at 50MHz. Because the flash runs at a maximum 20MHz, 3 wait states are required when executing from flash. However, 3 wait states is a little wasteful since 2.5 would be enough. Running the chip at 40MHz would only require 2 wait states, and would let the flash run at full speed. I'm not sure this would give a whole 16% speedup but I imagine it would offer a little bit of a boost.
For this test I used the MCU 8051 IDE simulator. It is really impressive! Being able to see all the RAM and registers helped a lot. It is similar in some ways to what I hoped to accomplish with my 6502 Virtual Trainer. When I translated the original BCD Add function from C, I left out the parts used for subtraction, since they weren't used in the test. Also, some of the values that don't change inside loops are calculated once before the loop and stored for use later. This makes sense because a good C compiler would do this for you. All values are stored in the internal DATA, although this would be far too small for real calculations. The simulator can also work with SDCC, so I was able to simulate the function in C after commenting out the parts for subtraction.
The assembly took a simulated 440us to execute. The C version took between 1365 and 1732us depending on memory model and whether the BCD data was stored in DATA or XDATA. Curiously, storing it in XDATA was always faster. The medium memory model was faster than the large memory model. It seems then that hand-coding the routine in assembly is 3-4 times faster. To be fair, the routines for other architectures might also show a speed up if they were rewritten. Compared to the original numbers of the test, a 3x speedup would make the DS89C450 faster than the MSP430 for this routine. The difference might actually be a little bigger since this test used an upgraded version of SDCC that might optimize better. With that in mind, an 8051-based calculator with external bus (coded in assembly) would outperform an MSP430 with SPI SRAM. That is a bit of a surprise after the results of the first test.
The LPC1114 could also be sped up a little. In the original test the chip was configured to run at 50MHz. Because the flash runs at a maximum 20MHz, 3 wait states are required when executing from flash. However, 3 wait states is a little wasteful since 2.5 would be enough. Running the chip at 40MHz would only require 2 wait states, and would let the flash run at full speed. I'm not sure this would give a whole 16% speedup but I imagine it would offer a little bit of a boost.