Wednesday, November 15, 2017

6502 BCD Multiplication

In preparation for the 6502-based calculator I am planning on building, I started thinking about the best way to do BCD multiplication. One of the reasons I was interested in working with the 6502 is its built in BCD mode. This works great for add and subtract but the chip does not have any sort of multiply or divide, BCD or otherwise. (Some of the 6800 family I was researching before do have hardware multiply, as does the 8051.) There are several routines that can make BCD multiplication faster, so I tried some of them on the 6502 to see which one works best.

To test these I used the Kowalski 6502 Simulator, which is neat because you can assemble in the simulator and then single step or run the program and see which line is being executed. Another convenient thing is that an input/output window is mapped into memory starting at at $E000. I also like the help window which automatically gives you information on op codes when you type them. On the other hand, the included macro system is extremely primitive and you can't do much with it, especially compared to the excellent macro system in CA65. It has a lot of other shortcomings and is fairly buggy, but I don't want to complain since it is free software. After I got frustrated at the lack of basic macro functionality, I tried assembling with CA65 and loading the binary into the simulator. The diassembler seems to work well but of course there is no way for the simulator to know label or function names. For these multiplication tests I decided to just make do with the simulator since I don't need much macro functionality. In the future, though, I will go back to using the 6502 trainer I made before.

To run the tests I made a loop that multiplies all combinations of BCD numbers $00 x $00 to $99 x $99. The Kowalski simulator has a function to count cycles, so I used that to find the time of an empty loop. For each test I recorded the cycles it took, subtracted the loop overhead time, and divided by 10,000 to get average cycles per calculation. After I was done, I went back and used look up tables for operations like shifting by 4 and halving numbers to speed things up a little. I also tried checking for 0 and 1 as arguments to skip the rest of the calculation, but this was not faster. In the table below, the table size column takes into account all the space the tables take up, including unused padding bytes inside the table. Here are the results in the order I did the tests:

MethodCycle averageCode sizeTable size
1Repeated additions1,596.23340
2Repeated additions, minimize loops1,236.66480
3Shift/add284.80670
44x8 lookup table101.006510k
54x8 lookup table, byte aligned97.006114.8k
6Russian peasant algorithm267.79660
7Quarter square table (half table)115.3385816
8Quarter square table (full table)118.00781,632
9Quarter square table (BCD to bin to BCD)86.00511,816
10Quarter square table (BCD to bin only)60.0034792
11Half squares113.60126200
12Simulated ROM lookup31.002046k

Descriptions of methods below:
1. Repeated additions
This is the easiest and most straightforward method. Just add one number to the final total over and over while counting down the other value.

2. Repeated additions, minimize loops
This is similar to the first method but with the smaller number as the loop counter. For example, for 1 x 99 the 1 becomes the counter rather than 99, so that the calculation becomes 99 added once, rather than 1 added 99 times.

3. Shift/add
This is a more standard way of doing multiplication that repeatedly adds one argument and uses each of the 4 bit places of the second argument to count down.

4. 4x8 bit lookup table
This way uses look up tables for one argument times the ones place of the other argument and another look up table for one argument times the 10s place. This is one of the faster ways to multiply but it requires a 10k table. In general, I hope to have plenty of room for look up tables in my calculator but faster ways also use far smaller tables.

5. 4x8 bit lookup table, byte aligned
This is similar to the above method but with the second table aligned to an $xx00 address, it eliminates the need to add the low byte of the table's address to the argument.

6. Russian peasant algorithm
This method doubles one argument while halving the other and adds the doubled value to the final total if the halved value is odd. It did not turn out to be especially fast on average

7. Quarter square table (half table)
This is one of the fastest ways to multiply. It relies on this equation:


When I learned about this, I looked the algorithm up on Wikipedia and made a version from that description. One look up table holds all the values for x^2/4 up to x = $198. This is convenient because all of those values fit into 2 bytes. The same table can be used for both a + b and a - b, but other implementations I found online use a second table for a - b.

8. Quarter square table (full table)
This works the same as the method above but it adds a second table to speed the second half of the calculation up. This is based on a very efficient way I found on the codebase64 website.

9. Quarter square table (BCD to bin to BCD)
After I finished testing the first 8 methods, I posted on the 6502 forum and asked for suggestions. One thing I figured out from that discussion is that I could convert the arguments to binary with a look up table to index into the table with the built in index modes. This is faster than using the original BCD arguments to find the table look up, since that requires adding them to the table address manually to get the offset. It is not possible to use the BCD arguments themselves for the built in address indexing since there is no way to tell the chip to add them to the address internally in BCD mode. I used the binary offset to load decimal values from the look up table, but then I had to convert those to BCD with a different lookup table.

10. Quarter square table (BCD to bin only)
After a while I realized that I could make method 9 faster by eliminating the binary to BCD conversion by storing BCD values in the table even though I use binary addressing to retrieve the value. This turned out to be the fastest method by far. 

11. Half squares
This relies on the same equation as the quarter squares method but moves the 4 inside the brackets of the square:


This requires you to divide a + b and a - b by two, but it lets you use one small table. One of the users on the 6502 forum recommended this method and claims it only needs 75 or cycles for BCD, but I could not get the code he gave to work. The problem is that a + b needs to be even. If it is odd, you need to add or subtract one from one argument and add the other argument to the total at the end. Testing for these special cases and correcting the final sum makes this routine relatively slow.

12. Simulated ROM lookup
The absolute fastest way I can think to multiply is storing the table in an external ROM and memory mapping buffers to drive the ROM. This way came out to 32 cycles, but I shrunk it to 31 by using address indexing.

No comments:

Post a Comment