The meaning of an N-bit word depends entirely on its interpretation == the representation set and the mapping we choose to use. A 16-bit register has the option of setting or clearing one of its bits. The value of this state is entirely dependent on what we decide it means.
Qn.q
One way to represent fixed point numbers is called the Q number format or Qn.q.
- Q denotes that the number is in the Q format notation
- n is the number of bits of the regist is the number of bits of the register (total number of bits)
- q is the number of bits used to designate the fractional part of the number == the number of bits to the right of the "binary point"
- Q may be prefixed by U or S to designate a signed or unsigned number
Examples
UQ16.0 - unsigned number of 16-bits with zero fraction bits meaning integer [0 to 65535]
UQ16.8 - unsigned number of 16-bits with 8 fraction bits leaving 8 integer bits [0 to 255]
SQ16.15 - signed number of 16-bits with 15 fraction bits leaving 0 integer bits [-1 to 1)
Bellow with green we denote integer bits, with dark blue the fraction bits and with red("ish"), the sign bit. Between the green part and the blue part there is an "imaginary" binary point:
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Value |
0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | Fixed | Unsigned | UQ16.0 | 29596 |
0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | Fixed | Unsigned | UQ16.8 | 115.609375 |
0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | Fixed | Signed | SQ16.15 | ~0.903198 |
0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | Float | - | - | 15638.86 |
What we observe?
Since we have a fixed register size, we have a trade-off between the dynamic range of numbers that we can represent, and the precision of those numbers:
- minimizing the dynamic range yields higher precision
- minimizing the dynamic range yields higher precision
Dynamic Range
Dynamic range is the ratio between the largest number and the lowest positive number that can be represented using the given Qn.q format. It can be represented in dB (decibels) using the following formula:
Dynamic Range (DR) = 20 * log10( |max_val| / |min_val| )
Note: in FXP notation, dynamic range depends only on the register size. Let's understand why!
Example
For  Q16 .0  we have (we omit modulus):
max_value = 2 N-1 - 1 = 65535
min_value = 1 (the lowest positive number)
max_value / min_value = 65535
DR = 20 * log10 65535 ~ 96dB
For  Q16 .16  we have (we omit modulus):
max_value = 1 - 2 N-1
min_value = 2 -(N-1)
max_value / min_value = [1 / 2-(N-1)] - [2N-1 / 2-(N-1)] = 2N-1 - 20 = 2N-1 - 1 = 65535
So DR is the same: 20 * log10 65535 ~ 96dB
Format conversions
We will see that inside a CNN (Convolutional Neural Network) we often need to modify the format (can you guess the reason?), so it's important to understand how these conversions work.
Converting from low to large
Converting a FXP value into a larger format can be done in two ways:
- increasing the integer part:
- extend using the sign bit (for signed formats)
- extend using zeroes (for unsigned formats)
- increasing the fractional part
Increasing the integer part
In this example we consider a signed FXP number => we extend using the sign bit when converting from SQ16.8 to SQ32.8:
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Value |
- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | SQ16.8 | -0.5 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | Q | Value |
In case you're wondering why extending the sign bit works (<=> gives the same value), the key is 2's complement representation. Recall that one of the properties of 2's complement is that arithmetic operations are identical (same HW adder, multiplier, etc). The sign bit (1 from position 15) is replicated up to position 31.
Increasing the fractional part
This is straight forward: we simply pad with zeros to the right:
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Value |
- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | SQ16.8 | -0.5 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | Q | Value |
Converting from large to low
Converting a FXP value into a smaller format can be done also in two ways:
The best way to explain is to actually see what happens with the numbers:
Truncation example
This is how an SQ32.24 can be truncated to an SQ16.1:
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Value |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | Fixed | Signed | SQ32.24 | 2.99999 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| →Shift right by 23 |
| 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | Fixed | Signed | SQ16.1 | 2.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Sign extend |
|
|
|
|
|
|
|
|
|
|
|
|
|
We observe that first we shift to the right until we represent the number with the desired precision. Then we sign-extend the remaining bits to the left.
Rounding example
Let's see the same example, but this time converted using the rounding method:
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Value |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | Fixed | Signed | SQ32.24 | 2.99999 |
+ |
|
|
|
|
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|
|
|
|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Add Rounding Factor |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|
|
|
|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | Rounded result |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| →Shift right by 23 |
| 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | Fixed | Signed | SQ16.1 | 3.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Sign extend |
|
|
|
|
|
|
|
|
|
|
|
|
|
We observe that we first add a rounding factor (and its value depends on the fractional size) and then we apply the same operation (shift and sign-extend) as above.
Fixed point math
We will first analyze the most important (and common) operations with FXP format.
Addition
Rule: In order to be able to add two FXP numbers, they must have the same format:
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Dec | Hex |
0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | SQ16.8 | 83.750000 | 53C0 |
+
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Dec | Hex |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | Fixed | Signed | SQ16.8 | 32.765625 | 20C4 |
=
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Dec | Hex |
0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | Fixed | Signed | SQ16.8 | 116.515625 | 7484 |
We can observe that the above operation is simply normal binary addition. What is particular is how we interpret the numbers, but this is something we already discussed.
Multiplication
Multiplication doesn't have this limitation. We can multiply any FXP format with any FXP format.
- Rule 1: The total number of bits for the result = the sum of the number of bits of the operands.
- Rule 2: The size of the fractional part for the result = the sum of the fractional sizes of the operands.
Example
Let's see an example for multiplying an SQ 16.8 with an SQ 16.4. If we apply the two rules we must obtain a signed number where the number of bits is as follows:
- total number of bits = total_nr_bits_N1 + total_nr_bits_N2 = 16 + 16 = 32
- number of fractional bit = fractional_size_N1 + fractional_size_N2 = 8 + 4 = 12
This means our result will have the format SQ32.12
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Dec |
- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | SQ16.8 | 3.75 |
x
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Dec |
- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | Fixed | Signed | SQ16.4 | 112.75 |
=
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Point | Signedness | Q | Dec |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Fixed | Signed | SQ32.12 | 422.8125 |
By this point you realize that arithmetic operations for fixed point numbers only imply integer arithmetic => FP unit is not even used! You might even consider that integer representation is just a special case of FXP representation :)
We stop here with our short introduction. I hope you enjoyed this material!
What next?
In the next part we will start with the MAC (Multiply-Accumulate) operation and its importance in Convolutional Neural Networks (CNN). We will present the performance gain of running an inference on a modified AlexNet used for image segmentation, using FP32 versus FXP (various formats) deployed on an NVIDIA Drive PX2 and Jetson TX2. We will also see the limitations which appear when doing this kind of neural network transformation.
We are also preparing a material about synchronizing clocks with high precision using (g)PTP over Ethernet or CAN.
But first, I have a question for you:
What do you think works best:
- 1. Training your network using FXP from the start and deploy it as it is
- 2. Or training it using normal FP32 and transform it at deployment time?
About this newsletter
This is the first of a series of technical newsletters by Edocti. We want to keep it technical and concise. We present things that we encounter during our R&D projects. Our main focus is Industrial Autonomous Driving, roboticsand Industrial IoT. We care about things like RTOS (QNX, Integrity, OSEK) and RT programming, Linux (embedded, Yocto, RT-Linux, POSIX, ...), Deep Learning, computer vision and OpenVX.