Edocti blog | Deep learning in real time embedded systems

The meaning of an N-bit word depends entirely on its interpretation == the representation set and the mapping we choose to use. A 16-bit register has the option of setting or clearing one of its bits. The value of this state is entirely dependent on what we decide it means.

Qn.q

One way to represent fixed point numbers is called the Q number format or Qn.q.

Q denotes that the number is in the Q format notation
n is the number of bits of the regist is the number of bits of the register (total number of bits)
q is the number of bits used to designate the fractional part of the number == the number of bits to the right of the "binary point"
Q may be prefixed by U or S to designate a signed or unsigned number

Examples

UQ16.0 - unsigned number of 16-bits with zero fraction bits meaning integer [0 to 65535]
UQ16.8 - unsigned number of 16-bits with 8 fraction bits leaving 8 integer bits [0 to 255]
SQ16.15 - signed number of 16-bits with 15 fraction bits leaving 0 integer bits [-1 to 1)

Bellow with green we denote integer bits, with dark blue the fraction bits and with red("ish"), the sign bit. Between the green part and the blue part there is an "imaginary" binary point:

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	Point	Signedness	Q	Value
0	1	1	1	0	0	1	1	1	0	0	1	1	1	0	Fixed	Unsigned	UQ16.0	29596
0	1	1	1	0	0	1	1	1	0	0	1	1	1	0	Fixed	Unsigned	UQ16.8	115.609375
0	1	1	1	0	0	1	1	1	0	0	1	1	1	0	Fixed	Signed	SQ16.15	~0.903198
0	1	1	1	0	0	1	1	1	0	0	1	1	1	0	Float	-	-	15638.86

What we observe?

Since we have a fixed register size, we have a trade-off between the dynamic range of numbers that we can represent, and the precision of those numbers:

minimizing the dynamic range yields higher precision
minimizing the dynamic range yields higher precision

Dynamic Range

Dynamic range is the ratio between the largest number and the lowest positive number that can be represented using the given Qn.q format. It can be represented in dB (decibels) using the following formula:
Dynamic Range (DR) = 20 * log₁₀( |max_val| / |min_val| )
Note: in FXP notation, dynamic range depends only on the register size. Let's understand why!

Example

For Q16 .0 we have (we omit modulus):
max_value = 2 ^N-1 - 1 = 65535
min_value = 1 (the lowest positive number)
max_value / min_value = 65535
DR = 20 * log₁₀ 65535 ~ 96dB

For Q16 .16 we have (we omit modulus):
max_value = 1 - 2 ^N-1
min_value = 2 -(N-1)
max_value / min_value = [1 / 2^-(N-1)] - [2^N-1 / 2^-(N-1)] = 2^N-1 - 2⁰ = 2^N-1 - 1 = 65535
So DR is the same: 20 * log₁₀ 65535 ~ 96dB

Format conversions

We will see that inside a CNN (Convolutional Neural Network) we often need to modify the format (can you guess the reason?), so it's important to understand how these conversions work.

Converting from low to large

Converting a FXP value into a larger format can be done in two ways:

increasing the integer part:
- extend using the sign bit (for signed formats)
- extend using zeroes (for unsigned formats)
increasing the fractional part

Increasing the integer part

In this example we consider a signed FXP number => we extend using the sign bit when converting from SQ16.8 to SQ32.8:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	Point	Signedness	Q	Value
-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	Fixed	Signed	SQ16.8	-0.5
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	Fixed	Signed	Q	Value

In case you're wondering why extending the sign bit works (<=> gives the same value), the key is 2's complement representation. Recall that one of the properties of 2's complement is that arithmetic operations are identical (same HW adder, multiplier, etc). The sign bit (1 from position 15) is replicated up to position 31.

Increasing the fractional part

This is straight forward: we simply pad with zeros to the right:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	Point	Signedness	Q	Value
-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	Fixed	Signed	SQ16.8	-0.5
1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	Fixed	Signed	Q	Value

Converting from large to low

Converting a FXP value into a smaller format can be done also in two ways:

truncation
rounding

The best way to explain is to actually see what happens with the numbers:

Truncation example

This is how an SQ32.24 can be truncated to an SQ16.1:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Value
0	0	0	0	0	0	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	Fixed	Signed	SQ32.24	2.99999

																→Shift right by 23							8	7	6	5	4	3	2	1	0
																							0	0	0	0	0	0	1	0	1

																15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
																0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	Fixed	Signed	SQ16.1	2.5
																Sign extend

We observe that first we shift to the right until we represent the number with the desired precision. Then we sign-extend the remaining bits to the left.

Rounding example

Let's see the same example, but this time converted using the rounding method:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Value
0	0	0	0	0	0	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	Fixed	Signed	SQ32.24	2.99999
+
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	Add Rounding Factor

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	1	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	Rounded result

																→Shift right by 23							8	7	6	5	4	3	2	1	0
																							0	0	0	0	0	0	1	1	0

																15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
																0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	Fixed	Signed	SQ16.1	3.0
																Sign extend

We observe that we first add a rounding factor (and its value depends on the fractional size) and then we apply the same operation (shift and sign-extend) as above.

Fixed point math

We will first analyze the most important (and common) operations with FXP format.

Addition

Rule: In order to be able to add two FXP numbers, they must have the same format:

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Dec	Hex
0	1	0	1	0	0	1	1	1	1	0	0	0	0	0	0	Fixed	Signed	SQ16.8	83.750000	53C0

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Dec	Hex
0	0	1	0	0	0	0	0	1	1	0	0	0	1	0	0	Fixed	Signed	SQ16.8	32.765625	20C4

15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Dec	Hex
0	1	1	1	0	1	0	0	1	0	0	0	0	1	0	0	Fixed	Signed	SQ16.8	116.515625	7484

We can observe that the above operation is simply normal binary addition. What is particular is how we interpret the numbers, but this is something we already discussed.

Multiplication

Multiplication doesn't have this limitation. We can multiply any FXP format with any FXP format.

Rule 1: The total number of bits for the result = the sum of the number of bits of the operands.
Rule 2: The size of the fractional part for the result = the sum of the fractional sizes of the operands.

Example

Let's see an example for multiplying an SQ 16.8 with an SQ 16.4. If we apply the two rules we must obtain a signed number where the number of bits is as follows:

total number of bits = total_nr_bits_N1 + total_nr_bits_N2 = 16 + 16 = 32
number of fractional bit = fractional_size_N1 + fractional_size_N2 = 8 + 4 = 12

This means our result will have the format SQ32.12

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Dec
-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	0	0	0	0	0	0	1	1	1	1	0	0	0	0	0	0	Fixed	Signed	SQ16.8	3.75

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Dec
-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	0	0	0	0	0	1	1	1	0	0	0	0	1	1	0	0	Fixed	Signed	SQ16.4	112.75

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	Point	Signedness	Q	Dec
0	0	0	0	0	0	0	0	0	0	0	1	1	0	1	0	0	1	1	0	1	1	0	1	0	0	0	0	0	0	0	0	Fixed	Signed	SQ32.12	422.8125

By this point you realize that arithmetic operations for fixed point numbers only imply integer arithmetic => FP unit is not even used! You might even consider that integer representation is just a special case of FXP representation :)

We stop here with our short introduction. I hope you enjoyed this material!

What next?

In the next part we will start with the MAC (Multiply-Accumulate) operation and its importance in Convolutional Neural Networks (CNN). We will present the performance gain of running an inference on a modified AlexNet used for image segmentation, using FP32 versus FXP (various formats) deployed on an NVIDIA Drive PX2 and Jetson TX2. We will also see the limitations which appear when doing this kind of neural network transformation.

We are also preparing a material about synchronizing clocks with high precision using (g)PTP over Ethernet or CAN.

But first, I have a question for you:

What do you think works best:

1. Training your network using FXP from the start and deploy it as it is
2. Or training it using normal FP32 and transform it at deployment time?

About this newsletter

This is the first of a series of technical newsletters by Edocti. We want to keep it technical and concise. We present things that we encounter during our R&D projects. Our main focus is Industrial Autonomous Driving, roboticsand Industrial IoT. We care about things like RTOS (QNX, Integrity, OSEK) and RT programming, Linux (embedded, Yocto, RT-Linux, POSIX, ...), Deep Learning, computer vision and OpenVX.

FXP	16-bit:	0.0000152587890625
FXP	16-bit normalized:	0.00006103515625
FXP	16-bit sub normal:	0.0000000596

Deep learning in real time embedded systems - part 1

Fixed point programming - motivation

Fixed point representation

Qn.q

Examples

What we observe?

Dynamic Range

Example

Format conversions

Converting from low to large

Increasing the integer part

Increasing the fractional part

Converting from large to low

Truncation example

Rounding example

Fixed point math

Addition

Multiplication

Example

What next?

About this newsletter