12345.6789 + 9999.99999 packed BCD on 8051 |
Although it's not entirely scientific, I chose two numbers to run the routines on: 12345.6789 + 9999.99999. The 6502 version took 708 cycles, while the 8051 took only 493. This was a big shock since the 8051 needs so many instructions to address external memory. Unlike the 6502, which has several indirect and indexed addressing modes, the movx instruction with only one addressing mode is the only way to access external RAM on the 8051. This means any indexing or pointer operations have to be performed manually using the DPTR register, which is tedious and takes up a lot of program space. When I converted my first BCD addition routine in C to 8051 assembly, I kept all the data in the first 256 bytes of RAM. Using this data space is much more convenient but unrealistic for the more complex calculator I want to build. I think this design makes it obvious that the chip is intended for microcontroller applications, rather than general processor uses like the 6502. I did not realize at the time how difficult it is to work with data types like my BCD type without indirection and indexing. My solution was to make a few macros that replace the addressing modes of the 6502. For example, compare the original version in 6502 assembly and how I replaced it for the 8051:
Line | 6502 | 8051 |
---|---|---|
001 002 003 004 005 006 007 008 009 010 011 012 013 014 |
... LDY #Offset STA (Address), Y | IndexDPTR0 MACRO DPTR_copy, Index clr C mov A, DPTR_copy add A, Index mov DP0L, A mov A, LOW(DPTR_copy)+1 addc A, #0 mov DP0H, A ENDM ... push A IndexDPTR0 Address, #Offset pop A movx @DPTR, A |
As you can see, it takes a lot of code to replace the 6502 addressing modes, which will eat up a lot of the code space. Fortunately, there will be a lot more than 64k code space with bank switching if I use this chip, so it shouldn't matter too much. The 6502 LDY instruction on line 12 takes 2 cycles and the indirect, indexed store on line 14 takes 6 cycles for a total of 8 cycles. For the 8051, the macro body takes 1 cycle per instruction for a total of 7 cycles, and the movx on line 15 take 2 more for a total of only 9. Even though the 8051 version eats up 13 bytes for the macro and 1 for the movx, compared to 2 for LDY and 2 for STA on the 6502, it is not much slower. Another caveat is that the macro trashes the A register, so anything important has to be saved and restored or calculated after the macro is called.
The 8051 model I worked with before, DS89C450, runs at 33MHz, although I found a few discussions that point out that that family of chip was rated for 50MHz when they first started being sold. Apparently, the chip can run that fast from XRAM, but not from internal flash. My plan would be to run from flash at half speed long enough to copy code into external ram then switch to a higher clock rate. This chip also claims to be single-cycle, which means one instruction cycle takes only one clock cycle, instead of 12 like most 8051s. Hopefully it is truly single-cycle and not pipelined, which loses its advantage when it jumps and the pipeline is emptied, although I could not find information on that in the datasheet. The chip also has other neat features like dual data pointers, which I used in my routine and the IDE 8051 simulator supports, automatic data pointer increment, and automatic data pointer swap. If I use these functions, I think I could make the routine even faster.
At 708 cycles, a modern 6502 running at 14MHz could do about 20,000 of the BCD additions I tested per second. However, I don't think the first version I make will be anywhere near that speed, since it won't be an SMD board. If I use one of the decoding methods I posted about before, I hope to achieve at least 5-6MHz, which would be less than 10,000 of those calculations per second. One thing that makes the routine a little slower is using the X register to index into a pseudo stack in zero page for local variables in the function. This adds one cycle more than accessing zero page directly but allows me to efficiently manage function memory. The 8051 has four sets of eight registers, which seems like a much better solution to that problem. With the DS89C450, I could achieve 60,000-100,000 additions per second, depending on how fast I can run. Even if the single-cycle mode is pipelined, it seems that the 8051 would be much faster than a 6502 for calculating. The 8051 has other advantages such as more address space, since it has separate code and data spaces, built in GPIO, built in flash for easy boot strapping, and timers. The major downside in my opinion is the lack of useful addressing modes for the external RAM. If I use this chip for a calculator, I will need to rely on a lot of macros for pointer calculations, as the 6502 definitely has an edge in this area.
After working with the 6502, here are some 8051 idiosyncrasies I noticed:
- Unlike STA on the 6502, mov can transfer data between registers and memory or memory and memory without affecting the accumulator. Strangely, it does not allow transfers between registers.
- add allows adds without carry, which would be nice on the 6502.
- mov allows the DPTR to be loaded with one 16-bit value, instead of two reads and writes like on the 6502.
- jnz only works on A, whereas BEQ on the 6502 can work with X and Y.
- push works with internal data addresses but not with register names.
- xch is really useful and saves a few cycles.
- swap and xchd allows you to swap nibbles, which is handy for BCD calculations.
- Rather than a separate 256 byte block for the stack like on the 6502, the 8051 stack has to fit somewhere in the first 128 bytes of RAM. The default stack, at least in the simulator, is at 0x08, which seems like a strange choice, since that is where the second register set resides.
- inc works on the accumulator on the 8051, and I wish it did on the 6502.
- clr also works on the accumulator, which is shorter and faster than LDA #0 on the 6502.
- movc using @A+DPTR is a decent way to index into code memory. Unfortunately, that doesn't work with movx.
- cjne can't compare the accumulator and registers, which is an odd limitation. It also sets the C flag after it compares. This is a useful way to tell greater than and less than but it means you have to push and pop the PSW status words to preserve C for multibyte additions.
- djnz does not work on A, which is another weird limitation.
No comments:
Post a Comment