Monday, July 29, 2013

RPN Scientific Calculator: More CORDIC Routines

After finishing the CORDIC routine for logarithms I started on the CORDIC routine for trig functions. For this I needed a different look up table and a different function, but it didn't take up much room in the flash. The table for pre-computed logarithms is stored compactly in flash and the same routine that unpacks it into the external memory is also used to unpack the trig table. With this I can calculate sine, cosine, tangent, and arctangent. It should also be possible to calculate arcsine as well but I haven't tried yet.

To test these new function I tried calculating atan(tan(37)). When using 16 decimal places for the calculations this gives the answer 36.9999999999999781, which is pretty accurate. Unfortunately, this took over two seconds, which is just too slow. The logarithm and powers of e functions were at least twice as fast. In order to speed things up I tried to eliminate as many reads and writes to the external memory as I could. One thing I didn't notice when I wrote the original functions is that the loop condition for a for loop is recalculated each iteration. When the condition depends on a variable in external memory, the variable is fetched on every iteration, which is wasteful if the variable hasn't changed. After changing the code to use local copies of external variables whenever possible, I was able to cut down external memory access by almost 25%.

25% wasn't quite enough so I looked at the code I was using to right shift. This allows me to divide numbers by 2 quickly but takes some extra coding since shifting in BCD is a little more complicated than a normal binary shift. Because of the extra work involved in keeping everything straight during BCD shifting, I was doing shifts one bit at a time. This means a 58 bit shift, for example, would simply do 58 shifts of 1 bit each. As you can imagine this is not efficient and was a big part of the slow down in the code. My solution was to use as many 4-bit shifts as needed, then a few 1 bit shifts at the end to finish it off.

Shifting by 4 bits in BCD is fairly easy. As you probably know, shifting by 4 bits is the same as dividing by 16, and dividing by a number (16 in this case) is the same as summing the results of dividing each decimal place by that number. Since there are only 10 possibilities (0-9) for each decimal place, I stored the results of dividing each possibility by 16 in a look up table and used that to sum the results. Here is an example:

789 ÷ 16 = 700/16 =   4375
                  80/16 =     5000
             +     9/16 =      5625
                               49.3125

The values 4375, 5000, 5625, etc are stored in the look up table. The sum of these results are stored in local RAM since 6 bytes is enough to keep track of the summing long enough to deal with any carries before finally writing the last byte to the external RAM. Doing it like this I only need to read and write every decimal place once per 4-bit shift. This cuts down the number of external memory accesses down to almost one third of what it was at the beginning and allowed me to calculate atan(tan(37)) in a little over a second. All of this fits in 8.1K of Flash meaning I just broke the halfway mark of what the MSP430G2553 chip I am using can hold. If there is any space left over at the end, I would like to implement a similar 8-bit shift routine as well to speed things up even more.

Wednesday, July 10, 2013

RPN Scientific Calculator: Division and Logarithms

After rewriting the multiplication function I started on the division routine. There are several ways to do this. I used the subtract and shift method. It takes up a little more than 150 lines of code which makes it about twice as big as the multiplication routine. Unlike with the other operations, with division a maximum number of decimal places has to be specified. It seems unlikely that I would ever need more than 16 places, which makes me think a fixed place format might have worked too. This is still possible if I need to shrink the program size.

The next step is to implement logarithms and powers of e. This will be useful on its own but will also let me calculate powers and roots. One way I found to calculate logarithms is with a Taylor series like this: 

ln[(1+x)/(1-x)]=2(x+x3/3+x5/5+x7/7+....)

This seems ideal since it only requires one loop and several multiply and divide operations which shouldn't take up much room in Flash. Even with the the MSP430 running at 16MHz, it rook over 3 seconds to calculate ln(7). In an effort to speed things up I examined the code for multiplying and dividing but didn't find much to improve. One thing I did change was the addition routine. I was adding a 0 to the beginning of every result so that there would be room for a 1 if I needed to carry. However, in most cases the 0 wasn't needed and when doing repeated additions during multiplication, the 0s accumulated at the beginning of the result wasting memory. Removing these 0s wasted time so I changed the addition code to only add the 0 when necessary. This still didn't make it much faster, though.

Next, I ran the code on my PC and noticed that it needed to access the external memory over 2 million times to calculate the logarithm to 16 decimal places. Clearly this was the bottleneck. The next idea I tried was the CORDIC method which is ideal for small microcontrollers like the MSP430. This uses a precomputed table and only needs adds and bit shifts, two things that microcontrollers can do much more efficiently than multiplying and dividing. Using this method I can calculate e^x and ln(x) for a wide range of values in less than one second because it only needs about 50,000 accesses to the external memory.

After I finished the ln function but before I started on the e^x function, I noticed that the program was already 6.5k. This is a little concerning since 16k is the max size for the program and there are still many more features to add. One thing that can reduce size is the static keyword. Making all my functions static shrunk the program size down to only 3.7k. After adding some more entries to the CORDIC table, improving the ln function, and adding the e^x function, the program size was back up to almost 6.5k again. This seems very strange since the program is now almost 1,000 lines and the ln functions only takes 50 lines. In fact commenting out the ln function reduces the program down to only 4.4k. Why this one small function should suddenly add more than 2k to my code is not clear but I hope to figure it out soon.