Tuesday, November 21, 2017

BCD Addition Speed Test: 6502 vs 8051

12345.6789 + 9999.99999 packed BCD on 8051
A few years ago I decided to revisit the microcontroller comparison I had done to see how much of a speed up I could get recoding my BCD addition routine in 8051 assembly. At the time I was surprised that it was 3-4 times faster than the C version. That original BCD routine, which I used for my RPN Scientific Calculator, uses unpacked BCD characters arranged with the most significant bytes first. For the calculator I want to build with a 6502, I wrote a new routine that stores the least significant bytes first in packed BCD. This takes advantage of the 6502's decimal mode to quickly add packed BCD bytes and avoids having to shift all the bytes when an extra byte has to be added for the final carry. It turns out that the 8051 has an instruction, DA, to correct the result of BCD additions from their binary result back into BCD, so it is easy to do packed BCD for it as well. To compare the speed of the two chips for use in a calculator, I translated the 6502 packed BCD routine to 8051 assembly and ran both versions in their respective simulators.

Although it's not entirely scientific, I chose two numbers to run the routines on: 12345.6789 + 9999.99999. The 6502 version took 708 cycles, while the 8051 took only 493. This was a big shock since the 8051 needs so many instructions to address external memory. Unlike the 6502, which has several indirect and indexed addressing modes, the movx instruction with only one addressing mode is the only way to access external RAM on the 8051. This means any indexing or pointer operations have to be performed manually using the DPTR register, which is tedious and takes up a lot of program space. When I converted my first BCD addition routine in C to 8051 assembly, I kept all the data in the first 256 bytes of RAM. Using this data space is much more convenient but unrealistic for the more complex calculator I want to build. I think this design makes it obvious that the chip is intended for microcontroller applications, rather than general processor uses like the 6502. I did not realize at the time how difficult it is to work with data types like my BCD type without indirection and indexing. My solution was to make a few macros that replace the addressing modes of the 6502. For example, compare the original version in 6502 assembly and how I replaced it for the 8051:

Line65028051
001
002
003
004
005
006
007
008
009
010
011
012
013
014 









...

LDY #Offset

STA (Address), Y 
IndexDPTR0 MACRO DPTR_copy, Index 
clr
mov A, DPTR_copy
add A, Index
mov DP0L, A
mov A, LOW(DPTR_copy)+1
addc A, #0
mov DP0H, A
ENDM
...
push A
IndexDPTR0 Address, #Offset
pop A
movx @DPTR, A

As you can see, it takes a lot of code to replace the 6502 addressing modes, which will eat up a lot of the code space. Fortunately, there will be a lot more than 64k code space with bank switching if I use this chip, so it shouldn't matter too much. The 6502 LDY instruction on line 12 takes 2 cycles and the indirect, indexed store on line 14 takes 6 cycles for a total of 8 cycles. For the 8051, the macro body takes 1 cycle per instruction for a total of 7 cycles, and the movx on line 15 take 2 more for a total of only 9. Even though the 8051 version eats up 13 bytes for the macro and 1 for the movx, compared to 2 for LDY and 2 for STA on the 6502, it is not much slower. Another caveat is that the macro trashes the A register, so anything important has to be saved and restored or calculated after the macro is called.

The 8051 model I worked with before, DS89C450, runs at 33MHz, although I found a few discussions that point out that that family of chip was rated for 50MHz when they first started being sold. Apparently, the chip can run that fast from XRAM, but not from internal flash. My plan would be to run from flash at half speed long enough to copy code into external ram then switch to a higher clock rate. This chip also claims to be single-cycle, which means one instruction cycle takes only one clock cycle, instead of 12 like most 8051s. Hopefully it is truly single-cycle and not pipelined, which loses its advantage when it jumps and the pipeline is emptied, although I could not find information on that in the datasheet. The chip also has other neat features like dual data pointers, which I used in my routine and the IDE 8051 simulator supports, automatic data pointer increment, and automatic data pointer swap. If I use these functions, I think I could make the routine even faster.

At 708 cycles, a modern 6502 running at 14MHz could do about 20,000 of the BCD additions I tested per second. However, I don't think the first version I make will be anywhere near that speed, since it won't be an SMD board. If I use one of the decoding methods I posted about before, I hope to achieve at least 5-6MHz, which would be less than 10,000 of those calculations per second. One thing that makes the routine a little slower is using the X register to index into a pseudo stack in zero page for local variables in the function. This adds one cycle more than accessing zero page directly but allows me to efficiently manage function memory. The 8051 has four sets of eight registers, which seems like a much better solution to that problem. With the DS89C450, I could achieve 60,000-100,000 additions per second, depending on how fast I can run. Even if the single-cycle mode is pipelined, it seems that the 8051 would be much faster than a 6502 for calculating. The 8051 has other advantages such as more address space, since it has separate code and data spaces, built in GPIO, built in flash for easy boot strapping, and timers. The major downside in my opinion is the lack of useful addressing modes for the external RAM. If I use this chip for a calculator, I will need to rely on a lot of macros for pointer calculations, as the 6502 definitely has an edge in this area.

After working with the 6502, here are some 8051 idiosyncrasies I noticed:
  • Unlike STA on the 6502, mov can transfer data between registers and memory or memory and memory without affecting the accumulator. Strangely, it does not allow transfers between registers.
  • add allows adds without carry, which would be nice on the 6502.
  • mov allows the DPTR to be loaded with one 16-bit value, instead of two reads and writes like on the 6502.
  • jnz only works on A, whereas BEQ on the 6502 can work with X and Y.
  • push works with internal data addresses but not with register names.
  • xch is really useful and saves a few cycles.
  • swap and xchd allows you to swap nibbles, which is handy for BCD calculations.
  • Rather than a separate 256 byte block for the stack like on the 6502, the 8051 stack has to fit somewhere in the first 128 bytes of RAM. The default stack, at least in the simulator, is at 0x08, which seems like a strange choice, since that is where the second register set resides.
  • inc works on the accumulator on the 8051, and I wish it did on the 6502.
  • clr also works on the accumulator, which is shorter and faster than LDA #0 on the 6502.
  • movc using @A+DPTR is a decent way to index into code memory. Unfortunately, that doesn't work with movx.
  • cjne can't compare the accumulator and registers, which is an odd limitation. It also sets the C flag after it compares. This is a useful way to tell greater than and less than but it means you have to push and pop the PSW status words to preserve C for multibyte additions.
  • djnz does not work on A, which is another weird limitation.

No comments:

Post a Comment