What is Fixed-Point Math?

Fixed-point math is math in which the decimal doesn’t move relative to the digits composing the number. Some call this ‘integer math’, which is a reference to the integer form that numbers normally take on the processor in fixed-point notations. Others call this ‘q math’ for the common q notation that this method has become known by – more on this later.

Why would you want to Use Fixed-Point Math?

There is one overriding reason to use fixed-point math: speed. Depending on the processor, floating-point math can take many times more cycles to execute than fixed-point math. The key is to know your processor’s capabilities. If you have a floating-point unit (FPU) on the processor, then the speed comparison of fixed-point vs. floating-point will be much more closely aligned. The list of processors without FPUs is long, so some general guidelines:

  • 8-bit and 16-bit processors don’t have floating-point units. This is not a limitation of technology, just the way things tend to shake out when manufacturers are designing the technology
  • Microchip’s PICxx – including dsPIC – don’t have a floating point unit until you get to the PIC32 series
  • Atmel’s ATTINY and ATMEGA don’t have them
  • STMicro’s STM32Fx series doesn’t include them until the ‘x’ gets to 4 or larger
  • TI’s MSP430 series generally don’t have FPUs.

You must consult the datasheet to be certain!

There is an additional reason to use fixed-point math instead of floating-point math even when an FPU is available and that is that fixed-point math tends to saturate much like real-world behavior. For instance, PID loops written in fixed-point math have a maximum value they can represent. In the real world, PWM has a maximum duty cycle (100%) that it can apply. This works out very well. A floating point implementation of a PID loop often has a statement that is something like (pseudocode) ‘if output is greater than 100%, set output equal to 100%’. Fixed-point implementations naturally saturate, emulating the hardware’s inherent limitations.

Usually, the decision to use fixed point notation is a combination of processor capability, speed requirement, and frequency. For instance, if you are executing a temperature PID loop, this loop probably execute at a maximum of 1Hz. Even a long series of calculations will take a small percentage of the processor’s time. On the other hand, a current control algorithm for motor phase currents might execution 50000 times per second. In this case, you probably want to go the fixed-point route.

Why wouldn’t you want to use Fixed-Point Math?

Fixed-point math has one limitation that can be severe: loss of resolution. In most cases, the loss of resolution involved is trivial or can be mitigated. In precision applications – such as measurement of a voltage across a strain gauge arranged across a Wheatstone bridge – the resolution loss would likely be intolerable and floating-point math should probably be utilized. Additionally, if you have an on-die FPU, your first draft should probably be floating point. Despite the loss in precision, one may utilize fixed-point math and – a majority of the time – have results that are good enough for the application.

Fixed-Point Concepts

Fixed-Point Notation and Basic Operations

I will not go into great depth when talking about floating-point notation, but I would be remiss if I didn’t at least mention it. Before writing the first line of code, you should know how many digits you have and where the decimal is. For more in-depth regarding floating-point notation, check out the Wikipedia article.
The short version: start with the capital letter ‘Q’. Easy enough. Next, the number of digits to the left of the decimal. Next, the decimal. Then the number of digits to the right of the decimal. Generally, the uppermost digit is considered a sign bit. As the number of digits to the right of the decimal increases, the precision of the format increases.

A Q16.16 number has twice the ‘resolution’ as a Q1.15, because the number to the right of the decimal has one more bit. It has 32768 times the range (the upper bit in each is a sign bit. A Q8.8 has a larger range than a Q1.15 number, but has a much more limited resolution. You may find that your application does not have enough resolution and/or range in one fixed-point format, but does in another. I have found that Q1.15 is one of the best first-draft formats on 16-bit devices. If you are rocking a 32-bit device, Q16.16 or Q1.31 might be more appropriate.

One more quick note regarding notation… notations that have a ‘1’ next to the ‘Q’ are also commonly referred to as ‘fractional’ since they can only represent numbers between -1.0 and +0.999. This appears to be a severe limitation but – as we will see – this notation often emulates the way we think about processes (think max speed) and – even better – the way things often work in the real world.

One of the most common formats to use on 16-bit processors is Q1.15, which we will use for the remainder of this article. Note that the number of digits adds up to ‘16’, which is a nice convenient value that is well-defined in C with ‘int16_t’. Note also that there is one digit to the left of the decimal and that this digit is a sign bit. A 16-bit signed value in integer math would have a range of -32768 to +32767. In Q1.15 math, this same range is mapped to -1.0 to +0.9999. Note that ‘Q1.15’ is often shortened to ‘Q15’.

For illustration, we should examine the behavior of Q1.15 math just to ensure that some of the implications are clear and to assist in debugging later. You – the user – will never see a value of 0.5 or any other decimal-like notation in the registers of your device. You will only see the integer representation of that value. Therefore, you need to know the basic logic. Presented here are some comparisons of generic operations to get you thinking.

Floating-point Operation Q1.15 Equivalent Remarks
0.5 × 0.5 = 0.25 16384 × 16384 = 8192 May take some getting used to, but makes perfect sense…
0.25/0.50 = 0.50 8192 / 16384 = 16384 …again, makes sense…
0.5 + 0.25 = 0.75 16384 + 8192 = 24576 …on a roll…
0.75 + 0.5 = 1.25 24576 + 16384 = 32767 …wait, what? Remember, the largest value that can be represented in Q1.15 is 32767, which is roughly equivalent to 0.9999. This demonstrates the saturating nature of Q math.
-0.75 – 0.5 = -1.25 -24576 – 16384 = -32768 Again, the value saturates negatively the same as it does positively. The only difference is that we can represent -1.0.
0.5 / 0.25 = 2.0 16384/8192=undefined Remember, the largest value we can represent is 0.9999, so we cannot represent 2.0. Technically this value is undefined, but it always makes sense to me to implement as a saturated value.

Q1.15 Implementations

Manufacturers will sometimes include ‘.a’ files with headers with their embedded compilers. Often, these libraries are good and work adequately, but do not always work consistently across implementations. In other words, Microchip’s libraries won’t necessarily work the same around the edge cases nor use the same names as Atmel’s libraries.

Understand that there are fixed-point libraries out there – many of them higher precision. Additionally, there are more functions than those below – such as trigonometric functions. I am covering the basic addition/multiplication/division to illustrate the concepts. Once you get the basic functions, you should be able to apply the concepts to any fixed-point library. A quick GitHub search reveals several libraries, but most are geared towards Q16.16 formatted numbers.

These basic routines should work on all implementations, but – to be sure – they are not the most efficient implementation on most processors! For instance, in the dsPIC series, the on-chip DSP has a mode which operates in Q1.15 mode natively. Not to mention that floating-point operations have been highly optimized on most compilers and operate very well. As a result, the below ‘pure-C’ functions are not likely to outperform a hand-tuned assembly function in most cases… but they can be illustrative of the concepts. Take a look at the github page I created for this project for a more detailed implementation.

Type Definitions

From an implementation standpoint, it doesn’t matter whether you create a type definition or not. This just makes your code more readable.

1
typedef int16_t q15_t;

Addition/Subtraction

In this section of code, note the manual saturation that occurs.  This is a 'correct' implementation, but shifting bits around is not the most efficient method on almost any platform.  Hand-tuned assembly can handle this in just a few cycles.

1
2
3
4
5
6
7
q15_t q15_add(q15_t addend, q15_t adder){
    int32_t result = (uint32_t)addend + (uint32_t)adder;

    if(result > 32767)          result = 32767;
    else if(result < -32768)    result = -32768;

    return (q15_t)result;

Multiplication

The multiplication routine is surprisingly simple.  There are no 'if-then' statements because all numbers are in the range of -1.0 to +0.99997, therefore, there is no opportunity for saturation.

1
2
3
4
q15_t q15_mul(q15_t multiplicand, q15_t multiplier){
    int32_t product = ((int32_t)multiplicand * (int32_t)multiplier) >> 15;
    return (q15_t)product;
}

Division

Unfortunately, division presents us with a couple of edge cases.  These aren't difficult to deal with, but they do eat cycles.

The most obvious is that the divisor must not be '0'.  As you learned in math class, this is either infinity or undefined (depending on who is teaching, from what I recall).

A more subtle - but just as important! - aspect is that the magnitude (some of you say 'absolute value') of the dividend must be smaller than the divisor.  Go back to the table if you don't see this.

In either case, we will simply saturate in the appropriate direction by XORing the most significant bit in each and deciding - based on that - what the saturated value should be.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
q15_t q15_div(q15_t dividend, q15_t divisor){
    q15_t quotient;

    /* check to ensure dividend is smaller in magnitude
     * than the divisor */
    if((q15_abs(divisor) < q15_abs(dividend)) || (divisor == 0)){
        /* saturation: if signs are different,
         * then saturate negative */
    if((divisor & 0x8000) ^ (dividend & 0x8000)){
        quotient = -32768;
    }else{
        quotient = 32767;
    }
    }else{
        quotient = 32768 * dividend/divisor;
    }

    return quotient;
}

Using Q1.15

With Q1.15 math, it is useful to begin thinking of variables as percentages of maximum rather than absolute values. For instance, you might say that the voltage of an A/D is 1.65V of a 3.3V max or you could say that it is at 50%. Another example, you can say that the speed of a motor is at 2700RPM of a maximum of 3000RPM or you could say that the motor speed is at 90%. I try to implement Q1.15 conversions at all of the ‘edges’ of my program while all of the internals work natively in Q1.15. This contributes to code reuse and – in many cases – the code auto-scales when I change peripheral resolution.  A couple of useful examples might be in order…

Scaling a Voltage

You have a 10-bit A/D converter. Again, you don’t care so much about the particular number so much as the percentage of maximum that the A/D is set to. We know that our maximum value is 1023, but we have to keep the divisor larger than the dividend in Q1.15 math, so we will use 1024 as the divisor.

1
2
#define AD_DIVISOR 1024  
q15_t scaledAD = q15_div((q15_t)rawADValue, (q15_t)AD_DIVISOR);

Now the ‘scaledAD’ variable contains the value of the A/D converter as a fixed-point value. It is a good idea to scale ALL parameters to Q1.15 notation if ANY of them are to be in Q1.15 calculations. This ensures that all variables are on the same scaling at all times. I don’t even keep ‘native’ values in memory anymore. I simply calculate the fixed-point equivalent and save that value instead.

Scaling a PWM Value

So you have your A/D reading in memory as a Q1.15 value and now you need to scale that to a duty cycle register within a PWM module. PWM modules usually have a dedicated timer and those timers have period registers associated with them. One quick and easy way to scale your desired PWM value to your period register is to use the multiply routine:

1
int16_t dutyCycleRegister = (int16_t)q15_mul(scaledAD, periodRegister);

Now, you can change your PWM frequency at will using the period register and the output duty cycle will always scale appropriately (within resolution, of course… if your period register is ‘2’ then you will still get very bad resolution no matter how precise the intermediate maths).

Scaling a Speed

How would you go about calculating the motor speed in Q1.15? This is more complicated than the previous examples, but is easier than you might expect. First, you should know the maximum speed of the motor. In this case, 3000RPM. You should be able to calculate the motor period from this, which is

1min/3000rotations × 60s/1min = 20ms/rotation

Now we know that 20ms/rotation is the maximum speed of the motor. We can use a timer to measure the motor period. This will be different in each implementation, but for the sake of explanation, lets say that a timer value of ‘500’ corresponds to 20ms/rotation. Use a define or constant in your code:

1
#define MIN_ROTATIONAL_PERIOD 500

Then you can use the current motor period that was most easily measured to easily calculate the speed:

1
q15_t rotationalSpeed = q15_div((Q15)MIN_ROTATIONAL_PERIOD, (q15_t)measuredPeriod);

Lets say that the measuredPeriod was ‘1000’:

500/1000=16384

Remember, ‘16384’ corresponds to 50%, so now we know the speed as a percentage of maximum speed!

Summary

Proper utilization of fixed-point math in embedded applications can encourage code reuse, application flexibility, and provide the appropriate resolution for the application.In a future post, we will examine an application and use Q1.15 libraries to implement a PID loop! Don't forget to check out the (more) complete source code!

You may also be interested in our article Avoiding Division and Modulo Operations.



© by Jason R. Jones 2016
My thanks to the Pelican and Python Communities.