Sunday, December 30, 2018

Tiny Calculator: Sine, Cosine, and Tangent

For the trig functions, I started building the same angle table that I used for the CORDIC routines in my RPN Scientific Calculator. Comparing my table to an example calculation I found of how CORDIC works, I noticed that the values are all pi times bigger than the ones in the example. Could I have made the same mistake in my RPN Scientific Calculator? When I get back home from the holidays, I should check to make sure that this does not throw off some of the calculations. Another thing I found out from the same page was that the K value to multiply by after the CORDIC calculation is calculated for a particular number of iterations. In my RPN Scientific Calculator, I included a K value for the maximum number of iterations and truncated it when less precision was used. On second thought, I should have done some testing to make sure truncating the K value to the precision of the argument always yields the best results for a particular level of precision. At the moment I don't plan to redo anything on my older calculators unless something is majorly wrong. It is good, though, to recognize this type of mistake, so I can look out for it in future projects that use CORDIC.

To generate the angle table I used the bc command line calculator. It has a very simple programming language that can run from a script. Interestingly, bc does not seem to do any type of rounding, so the tables I have are unrounded. Maybe I will experiment with a rounded table to see how much of an effect it has. For some of the other tables (including a couple I ended up not needing) I used Python, which has been really helpful. The table itself contains angles descending in size, so I only sto

The first step to implementing the trig functions was replicating the CORDIC calculations in a spreadsheet. This works for about 30 iterations until the numbers become too small for the spreadsheet to handle. Rather than doing bit shift operations like I would on the microcontroller, I multiplied by inverse powers of two, which accomplishes the same goal. When I had that working, I tried generating those multipliers on the MSP430 and multiplying, rather than bit shifts. I thought this might be fairly fast since the multiplier could be left in binary, which makes it quicker to use with the Russian Peasant algorithm, and the amount of zeroes in the multiplier would mean relatively few multiplies altogether. Consider this shortened table of successive inverse powers of two:

1.0000000000
0.5000000000
0.2500000000
0.1250000000
0.0625000000
0.0312500000
0.0156250000
0.0078125000
0.0039062500
0.0019531250
0.0009765625
0.0004882812
0.0002441406
0.0001220703
0.0000610351
0.0000305175
0.0000152587
0.0000076293
0.0000038146
0.0000019073
0.0000009536
0.0000004768
0.0000002384
0.0000001192
0.0000000596
0.0000000298
0.0000000149
0.0000000074
0.0000000037
0.0000000018
0.0000000009
0.0000000004
0.0000000002
0.0000000001

By starting multiplies with the first non-zero digit and ending with the last one, this method saves a lot of cycles by not looping needlessly through zeros. Overall, this routine takes at least 1.4 million cycles (I did not finish adding carries back in when I realized how many cycles it was taking). This is roughly ten times faster than the routine on my RPN Scientific Calculator which takes about a second to calculate trig functions to this precision.

The next thing I tried was shifting X and Y rather than using a multiplier. Each new iteration requires shifting X and Y by one extra bit, so the 110th and last iteration has to shift by 110 bits. To speed the shifting process up I use a table to do as many 8-bit shifts as possible then finish with 1-bit shifts. The table contains 4 bytes of packed BCD for each packed BCD byte from 0 to 99 (99/256=0.38671875). It also contains padding for bytes ending in A through F in hexadecimal so that I can look up BCD bytes without converting them to binary first. With 10 rows, six padding entries per row, and four bytes per entry, 240 of the table's 640 bytes are wasted padding. Because each row has 24 contiguous padding bytes, I should be able to store strings or even short routines there and reclaim some space if I start running out of room. The routine itself finishes in about 1.1 million cycle, so it is definitely faster than using a multiplier. It is also 2.5k versus the 1.8k of the multiplier version, which brings the total flash usage to 4.1k (25% of the total). This is pretty compact, so I still think I will easily be able to fit everything on to the chip I need to. The routine would probably be faster if I added tables for shifts of other lengths like 12 or 4 but this is definitely fast enough already. Another improvement might be to figure out the maximum number of places a shifted number will take and avoid shifting less significant decimal places that get shifted out anyway.

This is the first function I have allocated stack space for. In practice, I only really use this as storage space and do all calculations in registers. Keeping even a few words of stack space straight is a challenge, so I made some macros that assign stack space then generate a #define statement to associate a stack offset with a variable name. Here is an example:

BeginLocals 2 ;allocate two words on the stack
Local foo     ;assign foo to first word
Local bar     ;assign bar to second word
MOV #5,foo    ;move 5 to first word on stack
MOV #4,bar    ;move 4 to second word on stack
...
EndLocals     ;free words on stack

Like a lot of other things in IAR, macro support is absurdly broken. You cannot supply a register name as an argument, which I would like to do to get the macro to use one register as a frame pointer. Calculating register offsets based on the word count passed as an argument also does not work despite what the documentation shows, so I have to create an if statement if 1 word if allocated, another for when 2 are allocated, another for when 3 are allocated, etc. The debug cursor also bugs sometimes when it gets to a macro call. It seems the documentation is wrong in several places, since I couldn't get the ELSEIF statement in the documentation to work until I changed it to ELIF. The IAR about window shows that it is version 7, but I seriously don't think this mess is ready to be called version 1.00 yet.

One really important thing I realized is the trouble that comes from mixing 1's and 2's complement numbers. Some of the MSP430 documentation notes that the DADD instruction can do BCD addition but there is no BCD subtraction instruction. Instead, you can do a binary add of 0x6666 to a BCD number then invert all the bits. This produces the 1's complement of the number and when you add it to a BCD number with DADD and the carry set, it subtracts that number. Here is an example:

;Subtract 1234 from 4000
MOV #0x4000,R4
MOV #0x1234,R5
ADD #0x6666,R5 ;R5 is now 0x789A
INV R5         ;R5 is now 0x8765
SETC
DADD R5,R4     ;R5 is now 0x2766 and carry is set

The 1's complement version (0x8765) is the same as 0x9999-0x1234. Adding this to 0x4000 sets the carry bit, which lets you know the result is positive, and this is where the trouble starts. For any number that has to be saved, I store the carry output to show whether the number is negative or positive. It is also important since you have to set the carry when you add a negative 1's complement number to a positive number. The problem is that the result of this calculation is now a 2's complement number! This means that instead of representing -1234 as 0x9999-0x1234 with the carry set, it has to be represented as 0x10000-0x1234 with no carry set. This works fine once you understand the slight difference in representation until you try to subtract zero from a number. The 1's complement representation of 0 would be 0x9999-0=0x9999 and if you added this to 5 with the carry set, you would still have 5 and the carry would be set indicating a positive number (correct). The 2's complement, on the other hand, would be 0x10000-0, which is 0x0000 since that is all that fits in a 16-bit register. Adding 0 to 5 leaves 5 but does not set the carry! This would erroneously indicate that the answer is actually -9995 instead of 5. If I understand correctly, 2's complement in binary treats 0-0x7F as positive and 0x80-0xFF as negative, which seems analogous to including the sign bit I am storing separately in the number itself. I think I could accomplish the same thing in BCD by treating 0-0x49 as positive and 0x50-0x99 as negative. For the time being, I am converting 2's complement numbers back to 1's complement when I need to use them in subtraction, which wastes a lot of cycles. On the next calculator project I will try to do everything in 2's complement. Learning things like this is really valuable, since I will be able to get them right on bigger calculator projects in the future.

Writing assembly is still taking some getting used to. One thing I have noticed is that I don't use many subroutines. Maybe this is because the math routines I have worked on for this project were pretty short until I got to the trig calculations. Sometimes when I want to reuse a piece of code, I PUSH the return address before the code starts and put a RET at the end so that it still works when execution reaches it but I can also jump to it from somewhere else:

MOV #6,R4
PUSH #Mult3Done
Mult3:
   MOV R4,R5 ;save copy of R4
   RLA R4    ;multiply input by 2
   ADD R5,R4 ;add input to total
RET
Mult3Done: 

This works in line but I can also jump to it from elsewhere. This is different from putting all my subroutines in functions somewhere else like I would in C. It is nice to be able to jump around within the code without having to pass long lists of arguments like I would in C. Functions inside functions are very useful, although not every C compiler I have used supports them. One of the problems of using routines in routines like this is that I use whatever register happens to be free at that point in the code, although it may not be free in another place when I want to jump to the subroutine. This is something I would have to consider even if I moved the code out to a separate routine, but it does make me think that a compiler might do a better job than me figuring out what register to use to minimize saving and transferring registers to free them up. I also notice that I make a list of what every register is doing and whether it is free after every section of code. This is a little time-consuming but helps me keep everything straight. Other habits I noticed might be handled differently by a compiler. For example, these two pieces of code accomplish the same thing:

;Human readable:
SUB #16,R4   ;Reset pointer to end of 16 byte array 
             ;in R4 back to beginning of array
MOV #0,4(R4) ;Write 0 to 5th byte of the array

;Faster, smaller, less readable:
MOV #0,-12(R4) ;Write 0 to 5th byte of the array
               ;since R4 points to end of array

The next step is to go in the other direction and implement inverse trig functions then logarithms. It looks like I will have plenty of space for it.

No comments:

Post a Comment