The meaning of an Nbit word depends entirely on its interpretation == the representation set and the mapping we choose to use. A 16bit register has the option of setting or clearing one of its bits. The value of this state is entirely dependent on what we decide it means.
Qn.q
One way to represent fixed point numbers is called the Q number format or Qn.q.
 Q denotes that the number is in the Q format notation
 n is the number of bits of the regist is the number of bits of the register (total number of bits)
 q is the number of bits used to designate the fractional part of the number == the number of bits to the right of the "binary point"
 Q may be prefixed by U or S to designate a signed or unsigned number
Examples
UQ16.0  unsigned number of 16bits with zero fraction bits meaning integer [0 to 65535]
UQ16.8  unsigned number of 16bits with 8 fraction bits leaving 8 integer bits [0 to 255]
SQ16.15  signed number of 16bits with 15 fraction bits leaving 0 integer bits [1 to 1)
Bellow with green we denote integer bits, with dark blue the fraction bits and with red("ish"), the sign bit. Between the green part and the blue part there is an "imaginary" binary point:
15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Value 
0  1  1  1  0  0  1  1  1  0  0  1  1  1  0  0  Fixed  Unsigned  UQ16.0  29596 
0  1  1  1  0  0  1  1  1  0  0  1  1  1  0  0  Fixed  Unsigned  UQ16.8  115.609375 
0  1  1  1  0  0  1  1  1  0  0  1  1  1  0  0  Fixed  Signed  SQ16.15  ~0.903198 
0  1  1  1  0  0  1  1  1  0  0  1  1  1  0  0  Float      15638.86 
What we observe?
Since we have a fixed register size, we have a tradeoff between the dynamic range of numbers that we can represent, and the precision of those numbers:
 minimizing the dynamic range yields higher precision
 minimizing the dynamic range yields higher precision
Dynamic Range
Dynamic range is the ratio between the largest number and the lowest positive number that can be represented using the given Qn.q format. It can be represented in dB (decibels) using the following formula:
Dynamic Range (DR) = 20 * log_{10}( max_val / min_val )
Note: in FXP notation, dynamic range depends only on the register size. Let's understand why!
Example
For Q16 .0 we have (we omit modulus):
max_value = 2 ^{N1}  1 = 65535
min_value = 1 (the lowest positive number)
max_value / min_value = 65535
DR = 20 * log_{10} 65535 ~ 96dB
For Q16 .16 we have (we omit modulus):
max_value = 1  2 ^{N1}
min_value = 2 ^{}(N1)
max_value / min_value = [1 / 2^{(N1)}]  [2^{N1} / 2^{(N1)}] = 2^{N1}  2^{0} = 2^{N1}  1 = 65535
So DR is the same: 20 * log_{10} 65535 ~ 96dB
Format conversions
We will see that inside a CNN (Convolutional Neural Network) we often need to modify the format (can you guess the reason?), so it's important to understand how these conversions work.
Converting from low to large
Converting a FXP value into a larger format can be done in two ways:
 increasing the integer part:
 extend using the sign bit (for signed formats)
 extend using zeroes (for unsigned formats)
 increasing the fractional part
Increasing the integer part
In this example we consider a signed FXP number => we extend using the sign bit when converting from SQ16.8 to SQ32.8:
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Value 
                                1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  Fixed  Signed  SQ16.8  0.5 
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  Fixed  Signed  Q  Value 
In case you're wondering why extending the sign bit works (<=> gives the same value), the key is 2's complement representation. Recall that one of the properties of 2's complement is that arithmetic operations are identical (same HW adder, multiplier, etc). The sign bit (1 from position 15) is replicated up to position 31.
Increasing the fractional part
This is straight forward: we simply pad with zeros to the right:
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Value 
                                1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  Fixed  Signed  SQ16.8  0.5 
1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  Fixed  Signed  Q  Value 
Converting from large to low
Converting a FXP value into a smaller format can be done also in two ways:
The best way to explain is to actually see what happens with the numbers:
Truncation example
This is how an SQ32.24 can be truncated to an SQ16.1:
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Value 
0  0  0  0  0  0  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  Fixed  Signed  SQ32.24  2.99999 



















































 →Shift right by 23 
 8  7  6  5  4  3  2  1  0 


























 0  0  0  0  0  0  1  0  1 























































 15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0 



















 0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  Fixed  Signed  SQ16.1  2.5 















 Sign extend 













We observe that first we shift to the right until we represent the number with the desired precision. Then we signextend the remaining bits to the left.
Rounding example
Let's see the same example, but this time converted using the rounding method:
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Value 
0  0  0  0  0  0  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  Fixed  Signed  SQ32.24  2.99999 
+ 




31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0 




0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  Add Rounding Factor 




































31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0 




0  0  0  0  0  0  1  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  Rounded result 



















































 →Shift right by 23 
 8  7  6  5  4  3  2  1  0 


























 0  0  0  0  0  0  1  1  0 























































 15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0 



















 0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0  Fixed  Signed  SQ16.1  3.0 















 Sign extend 













We observe that we first add a rounding factor (and its value depends on the fractional size) and then we apply the same operation (shift and signextend) as above.
Fixed point math
We will first analyze the most important (and common) operations with FXP format.
Addition
Rule: In order to be able to add two FXP numbers, they must have the same format:
15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Dec  Hex 
0  1  0  1  0  0  1  1  1  1  0  0  0  0  0  0  Fixed  Signed  SQ16.8  83.750000  53C0 
+
15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Dec  Hex 
0  0  1  0  0  0  0  0  1  1  0  0  0  1  0  0  Fixed  Signed  SQ16.8  32.765625  20C4 
=
15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Dec  Hex 
0  1  1  1  0  1  0  0  1  0  0  0  0  1  0  0  Fixed  Signed  SQ16.8  116.515625  7484 
We can observe that the above operation is simply normal binary addition. What is particular is how we interpret the numbers, but this is something we already discussed.
Multiplication
Multiplication doesn't have this limitation. We can multiply any FXP format with any FXP format.
 Rule 1: The total number of bits for the result = the sum of the number of bits of the operands.
 Rule 2: The size of the fractional part for the result = the sum of the fractional sizes of the operands.
Example
Let's see an example for multiplying an SQ 16.8 with an SQ 16.4. If we apply the two rules we must obtain a signed number where the number of bits is as follows:
 total number of bits = total_nr_bits_N1 + total_nr_bits_N2 = 16 + 16 = 32
 number of fractional bit = fractional_size_N1 + fractional_size_N2 = 8 + 4 = 12
This means our result will have the format SQ32.12
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Dec 
                                0  0  0  0  0  0  1  1  1  1  0  0  0  0  0  0  Fixed  Signed  SQ16.8  3.75 
x
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Dec 
                                0  0  0  0  0  1  1  1  0  0  0  0  1  1  0  0  Fixed  Signed  SQ16.4  112.75 
=
31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  2  1  0  Point  Signedness  Q  Dec 
0  0  0  0  0  0  0  0  0  0  0  1  1  0  1  0  0  1  1  0  1  1  0  1  0  0  0  0  0  0  0  0  Fixed  Signed  SQ32.12  422.8125 
By this point you realize that arithmetic operations for fixed point numbers only imply integer arithmetic => FP unit is not even used! You might even consider that integer representation is just a special case of FXP representation :)
We stop here with our short introduction. I hope you enjoyed this material!
What next?
In the next part we will start with the MAC (MultiplyAccumulate) operation and its importance in Convolutional Neural Networks (CNN). We will present the performance gain of running an inference on a modified AlexNet used for image segmentation, using FP32 versus FXP (various formats) deployed on an NVIDIA Drive PX2 and Jetson TX2. We will also see the limitations which appear when doing this kind of neural network transformation.
We are also preparing a material about synchronizing clocks with high precision using (g)PTP over Ethernet or CAN.
But first, I have a question for you:
What do you think works best:
 1. Training your network using FXP from the start and deploy it as it is
 2. Or training it using normal FP32 and transform it at deployment time?
About this newsletter
This is the first of a series of technical newsletters by Edocti. We want to keep it technical and concise. We present things that we encounter during our R&D projects. Our main focus is Industrial Autonomous Driving, roboticsand Industrial IoT. We care about things like RTOS (QNX, Integrity, OSEK) and RT programming, Linux (embedded, Yocto, RTLinux, POSIX, ...), Deep Learning, computer vision and OpenVX.