After finishing the CORDIC routine for logarithms I started on the CORDIC routine for trig functions. For this I needed a different look up table and a different function, but it didn't take up much room in the flash. The table for pre-computed logarithms is stored compactly in flash and the same routine that unpacks it into the external memory is also used to unpack the trig table. With this I can calculate sine, cosine, tangent, and arctangent. It should also be possible to calculate arcsine as well but I haven't tried yet.
To test these new function I tried calculating atan(tan(37)). When using 16 decimal places for the calculations this gives the answer 36.9999999999999781, which is pretty accurate. Unfortunately, this took over two seconds, which is just too slow. The logarithm and powers of e functions were at least twice as fast. In order to speed things up I tried to eliminate as many reads and writes to the external memory as I could. One thing I didn't notice when I wrote the original functions is that the loop condition for a for loop is recalculated each iteration. When the condition depends on a variable in external memory, the variable is fetched on every iteration, which is wasteful if the variable hasn't changed. After changing the code to use local copies of external variables whenever possible, I was able to cut down external memory access by almost 25%.
25% wasn't quite enough so I looked at the code I was using to right shift. This allows me to divide numbers by 2 quickly but takes some extra coding since shifting in BCD is a little more complicated than a normal binary shift. Because of the extra work involved in keeping everything straight during BCD shifting, I was doing shifts one bit at a time. This means a 58 bit shift, for example, would simply do 58 shifts of 1 bit each. As you can imagine this is not efficient and was a big part of the slow down in the code. My solution was to use as many 4-bit shifts as needed, then a few 1 bit shifts at the end to finish it off.
Shifting by 4 bits in BCD is fairly easy. As you probably know, shifting by 4 bits is the same as dividing by 16, and dividing by a number (16 in this case) is the same as summing the results of dividing each decimal place by that number. Since there are only 10 possibilities (0-9) for each decimal place, I stored the results of dividing each possibility by 16 in a look up table and used that to sum the results. Here is an example:
789 ÷ 16 = 700/16 = 4375
80/16 = 5000
+ 9/16 = 5625
49.3125
The values 4375, 5000, 5625, etc are stored in the look up table. The sum of these results are stored in local RAM since 6 bytes is enough to keep track of the summing long enough to deal with any carries before finally writing the last byte to the external RAM. Doing it like this I only need to read and write every decimal place once per 4-bit shift. This cuts down the number of external memory accesses down to almost one third of what it was at the beginning and allowed me to calculate atan(tan(37)) in a little over a second. All of this fits in 8.1K of Flash meaning I just broke the halfway mark of what the MSP430G2553 chip I am using can hold. If there is any space left over at the end, I would like to implement a similar 8-bit shift routine as well to speed things up even more.
To test these new function I tried calculating atan(tan(37)). When using 16 decimal places for the calculations this gives the answer 36.9999999999999781, which is pretty accurate. Unfortunately, this took over two seconds, which is just too slow. The logarithm and powers of e functions were at least twice as fast. In order to speed things up I tried to eliminate as many reads and writes to the external memory as I could. One thing I didn't notice when I wrote the original functions is that the loop condition for a for loop is recalculated each iteration. When the condition depends on a variable in external memory, the variable is fetched on every iteration, which is wasteful if the variable hasn't changed. After changing the code to use local copies of external variables whenever possible, I was able to cut down external memory access by almost 25%.
25% wasn't quite enough so I looked at the code I was using to right shift. This allows me to divide numbers by 2 quickly but takes some extra coding since shifting in BCD is a little more complicated than a normal binary shift. Because of the extra work involved in keeping everything straight during BCD shifting, I was doing shifts one bit at a time. This means a 58 bit shift, for example, would simply do 58 shifts of 1 bit each. As you can imagine this is not efficient and was a big part of the slow down in the code. My solution was to use as many 4-bit shifts as needed, then a few 1 bit shifts at the end to finish it off.
Shifting by 4 bits in BCD is fairly easy. As you probably know, shifting by 4 bits is the same as dividing by 16, and dividing by a number (16 in this case) is the same as summing the results of dividing each decimal place by that number. Since there are only 10 possibilities (0-9) for each decimal place, I stored the results of dividing each possibility by 16 in a look up table and used that to sum the results. Here is an example:
789 ÷ 16 = 700/16 = 4375
80/16 = 5000
+ 9/16 = 5625
49.3125
The values 4375, 5000, 5625, etc are stored in the look up table. The sum of these results are stored in local RAM since 6 bytes is enough to keep track of the summing long enough to deal with any carries before finally writing the last byte to the external RAM. Doing it like this I only need to read and write every decimal place once per 4-bit shift. This cuts down the number of external memory accesses down to almost one third of what it was at the beginning and allowed me to calculate atan(tan(37)) in a little over a second. All of this fits in 8.1K of Flash meaning I just broke the halfway mark of what the MSP430G2553 chip I am using can hold. If there is any space left over at the end, I would like to implement a similar 8-bit shift routine as well to speed things up even more.