12345.6789 + 9999.99999 packed BCD on 8051 |
▼
Tuesday, November 21, 2017
BCD Addition Speed Test: 6502 vs 8051
Wednesday, November 15, 2017
6502 BCD Multiplication
In preparation for the 6502-based calculator I am planning on building, I started thinking about the best way to do BCD multiplication. One of the reasons I was interested in working with the 6502 is its built in BCD mode. This works great for add and subtract but the chip does not have any sort of multiply or divide, BCD or otherwise. (Some of the 6800 family I was researching before do have hardware multiply, as does the 8051.) There are several routines that can make BCD multiplication faster, so I tried some of them on the 6502 to see which one works best.
To test these I used the Kowalski 6502 Simulator, which is neat because you can assemble in the simulator and then single step or run the program and see which line is being executed. Another convenient thing is that an input/output window is mapped into memory starting at at $E000. I also like the help window which automatically gives you information on op codes when you type them. On the other hand, the included macro system is extremely primitive and you can't do much with it, especially compared to the excellent macro system in CA65. It has a lot of other shortcomings and is fairly buggy, but I don't want to complain since it is free software. After I got frustrated at the lack of basic macro functionality, I tried assembling with CA65 and loading the binary into the simulator. The diassembler seems to work well but of course there is no way for the simulator to know label or function names. For these multiplication tests I decided to just make do with the simulator since I don't need much macro functionality. In the future, though, I will go back to using the 6502 trainer I made before.
To run the tests I made a loop that multiplies all combinations of BCD numbers $00 x $00 to $99 x $99. The Kowalski simulator has a function to count cycles, so I used that to find the time of an empty loop. For each test I recorded the cycles it took, subtracted the loop overhead time, and divided by 10,000 to get average cycles per calculation. After I was done, I went back and used look up tables for operations like shifting by 4 and halving numbers to speed things up a little. I also tried checking for 0 and 1 as arguments to skip the rest of the calculation, but this was not faster. In the table below, the table size column takes into account all the space the tables take up, including unused padding bytes inside the table. Here are the results in the order I did the tests:
Descriptions of methods below:
To test these I used the Kowalski 6502 Simulator, which is neat because you can assemble in the simulator and then single step or run the program and see which line is being executed. Another convenient thing is that an input/output window is mapped into memory starting at at $E000. I also like the help window which automatically gives you information on op codes when you type them. On the other hand, the included macro system is extremely primitive and you can't do much with it, especially compared to the excellent macro system in CA65. It has a lot of other shortcomings and is fairly buggy, but I don't want to complain since it is free software. After I got frustrated at the lack of basic macro functionality, I tried assembling with CA65 and loading the binary into the simulator. The diassembler seems to work well but of course there is no way for the simulator to know label or function names. For these multiplication tests I decided to just make do with the simulator since I don't need much macro functionality. In the future, though, I will go back to using the 6502 trainer I made before.
To run the tests I made a loop that multiplies all combinations of BCD numbers $00 x $00 to $99 x $99. The Kowalski simulator has a function to count cycles, so I used that to find the time of an empty loop. For each test I recorded the cycles it took, subtracted the loop overhead time, and divided by 10,000 to get average cycles per calculation. After I was done, I went back and used look up tables for operations like shifting by 4 and halving numbers to speed things up a little. I also tried checking for 0 and 1 as arguments to skip the rest of the calculation, but this was not faster. In the table below, the table size column takes into account all the space the tables take up, including unused padding bytes inside the table. Here are the results in the order I did the tests:
Method | Cycle average | Code size | Table size | |
1 | Repeated additions | 1,596.23 | 34 | 0 |
2 | Repeated additions, minimize loops | 1,236.66 | 48 | 0 |
3 | Shift/add | 284.80 | 67 | 0 |
4 | 4x8 lookup table | 101.00 | 65 | 10k |
5 | 4x8 lookup table, byte aligned | 97.00 | 61 | 14.8k |
6 | Russian peasant algorithm | 267.79 | 66 | 0 |
7 | Quarter square table (half table) | 115.33 | 85 | 816 |
8 | Quarter square table (full table) | 118.00 | 78 | 1,632 |
9 | Quarter square table (BCD to bin to BCD) | 86.00 | 51 | 1,816 |
10 | Quarter square table (BCD to bin only) | 60.00 | 34 | 792 |
11 | Half squares | 113.60 | 126 | 200 |
12 | Simulated ROM lookup | 31.00 | 20 | 46k |
Descriptions of methods below:
Friday, November 3, 2017
6502 Address Decoding
For the 6502 graphing calculator I started working on two years ago I wanted to use an ATF1508 CPLD to do the address decoding. In another post, I wrote about how difficult it was to get this chip going. For one, the program I found to generate SVF files for it is 16-bit only. At the time I was running it in a virtual machine, but I don't have my Windows XP disk any more. If I keep using the homemade setup for it I had before, instead of buying a programmer (which is expensive), I think I might be able to run the program in a DOS emulator without Windows.
Another way I was thinking about doing the decoding is with a 32k 12ns SRAM. For each address, the SRAM would act as a lookup table for the chip selects and other signals that the CPLD normaly does. This would not be as fast as the 7.5ns of the ATF1508, but it would still be very fast. One way would be to hold the 6502 in reset and cycle a counter through all 16-bits of the address space copying information from an EEPROM to the SRAM. After a certain amount of time, a timer would fire to disable that EEPROM and the counter and enable the SRAM, EEPROM, and 6502 of the main system. I'm not sure I got all of the signals exactly right but I made this schematic to illustrate approximately what I was thinking. The disadvantages of this setup is that the extra SRAM consumes 160ma, which would be a lot for the battery-based calculator I want to build, and the extra space the counters and second EEPROM would take.
Another way I thought about was disabling the main SRAM on startup but directing all reads to the EEPROM and all writes to the address decoding SRAM. This would let me write the whole lookup table into that SRAM with the 6502 . A transparent buffer would separate that SRAM from the data bus and prevent it from being written to further after a timer fires. If I got everything right, this would require more glue logic, but it would get rid of the counters and extra EEPROM of the last design.
A third way I thought of is to skip writing the memory decoder SRAM and use an EEPROM. Unfortunately, that would need 45ns or so to do the decoding, which is a lot slower than a CPLD. One alternative I found is 8k 25ns NVRAM, which stores its value permanently in an EEPROM but loads its content into SRAM when it turns on. This would give me 8 byte resolution, so I would probably lose less than 200 bytes of address space for peripherals. Also, if I drive the SPI clock with the NVRAM then I could set it low for one 8 byte stretch and fill that stretch with NOP instructions. Then to pull the clock low for a given number of cycles, I would just have to jump to the corresponding NOP in that stretch of memory.
On the 6502 forum I found a really great page on 6502 bus timing. If I understand the timing correctly, the 6502 needs up to 30ns to output its address and then the address decoding logic can go to work (either 12 or 25ns in this case). After that, the 74HC670 I have in the diagram will need 16ns up to nearly 40ns in the worst case to drive the high bits of the 512k SRAM. If I can get away with only two windows into the SRAM then it would only be about 6ns max if I replace the 74HC670 with a buffer and 74AS244. In order to avoid accidentally writing to wrong addresses, those delays will have to propagate before the clock goes high and the glue logic brings the WE\ of the SRAM low. This would allow me to run between 5 and 10MHz, depending on the combination. This is a lot less than the 14 MHz maximum the W65C02 I am using is rated for but it might work until I get the kinks out of working with the ATF1508.
6502 address decoder with SRAM |
Another way I thought about was disabling the main SRAM on startup but directing all reads to the EEPROM and all writes to the address decoding SRAM. This would let me write the whole lookup table into that SRAM with the 6502 . A transparent buffer would separate that SRAM from the data bus and prevent it from being written to further after a timer fires. If I got everything right, this would require more glue logic, but it would get rid of the counters and extra EEPROM of the last design.
A third way I thought of is to skip writing the memory decoder SRAM and use an EEPROM. Unfortunately, that would need 45ns or so to do the decoding, which is a lot slower than a CPLD. One alternative I found is 8k 25ns NVRAM, which stores its value permanently in an EEPROM but loads its content into SRAM when it turns on. This would give me 8 byte resolution, so I would probably lose less than 200 bytes of address space for peripherals. Also, if I drive the SPI clock with the NVRAM then I could set it low for one 8 byte stretch and fill that stretch with NOP instructions. Then to pull the clock low for a given number of cycles, I would just have to jump to the corresponding NOP in that stretch of memory.
On the 6502 forum I found a really great page on 6502 bus timing. If I understand the timing correctly, the 6502 needs up to 30ns to output its address and then the address decoding logic can go to work (either 12 or 25ns in this case). After that, the 74HC670 I have in the diagram will need 16ns up to nearly 40ns in the worst case to drive the high bits of the 512k SRAM. If I can get away with only two windows into the SRAM then it would only be about 6ns max if I replace the 74HC670 with a buffer and 74AS244. In order to avoid accidentally writing to wrong addresses, those delays will have to propagate before the clock goes high and the glue logic brings the WE\ of the SRAM low. This would allow me to run between 5 and 10MHz, depending on the combination. This is a lot less than the 14 MHz maximum the W65C02 I am using is rated for but it might work until I get the kinks out of working with the ATF1508.
Wednesday, November 1, 2017
6800 Family Tree
After about two years away from electronics, I have some time for my projects now that I am almost finished with my degree. Lately, I've been thinking about a lot of things and I want to start some new calculator projects and pick up where I left off on some other projects. One thing I have been thinking about is working on a simpler 6502 based calculator than I was working on before to get more experience working with the chip. I think the 6502 graphing calculator I was working on did not turn out very well because I rushed through everything too fast trying to get ready for Makevention in 2015.
Along those lines I recently got curious about other chips I could use to make a calculator. The last time I was curious about that I compared some of the microcontrollers I was interested in to see how fast they compute. One of my plans now is to do some other comparisons, which I think will be more useful. This time I want to compare processors like the 6502 as well, so I looked at other chips that would be fun to use. One group that I have been interested in for a while is the 6800 family like the 6809 and 6309. I could not find a family tree on the internet, so I started making one myself. When I was almost done, I found this tree in a document about the 6804:
Along those lines I recently got curious about other chips I could use to make a calculator. The last time I was curious about that I compared some of the microcontrollers I was interested in to see how fast they compute. One of my plans now is to do some other comparisons, which I think will be more useful. This time I want to compare processors like the 6502 as well, so I looked at other chips that would be fun to use. One group that I have been interested in for a while is the 6800 family like the 6809 and 6309. I could not find a family tree on the internet, so I started making one myself. When I was almost done, I found this tree in a document about the 6804: