Edocti
Advanced Technical Training for the Software Engineer of Tomorrow

Deep learning at the edge - part one

Deploying machine learning in real time embedded systems often comes with the constraint of reducing the latency, the memory footprint and the power consumption while achieving the same accuracy. BUT: in certain applications (e.g. Functional-Safety critical) you need strong guarantees that what your neural network infers is the right thing (i.e. a 20 km/h limitation is not a 50 km/h limitation!). You might think that in order to achieve good accuracy you need as many fractional bits as possible.

That idea makes perfect sense, but we will show you that it's possible to achieve all goals by using the right algorithms and a floating point representation which uses a smaller number of bits. The memory footprint decreases dramatically and so does the latency.

This technical newsletter is the first part, and introduces fixed point programming in a concise way. In the next part we'll see FXP applied on an image segmentation problem deployed on NVIDIA  Drive PX2 and on Jetson TX2.

Fixed point programming - motivation

Achieve higher OPS, smaller memory footprint, lower latency and sometimes even higher precision. You often need to employ these techniques in embedded systems with high computation requirements, low-latency and low power constraints.

But this helps not only in embedded systems! Having a farm of expensive AWS EC2 instances (e.g. p2.8xlarge ) or even DGX-1 machines for end-to-end testing (you usually need hundreds of hours of inference) is not cheap! Lowering the computation times and memory usage to 1/2 will save you money.

Fixed point (FXP) performance may reach up to 10 times higher performance per Watt (and ~2-3 times latency reduction) compared to a floating point (FP) implementation depending on the application and precision requirements. In some applications FXP processing will achieve higher accuracy than its FP version. We'll see that in the next edition.

Example: when calculating probabilities we know that

  • All numbers are positive
  • The maximum numerical value is 1

Let's compare the maximal resolutions and we'll soon understand how we obtained these numbers:

FXP     16-bit:     0.0000152587890625
FXP     16-bit normalized:     0.00006103515625
FXP     16-bit sub normal:     0.0000000596

Fixed point representation

The meaning of an N-bit word depends entirely on its interpretation == the representation set and the mapping we choose to use. A 16-bit register has the option of setting or clearing one of its bits. The value of this state is entirely dependent on what we decide it means.

Qn.q

One way to represent fixed point numbers is called the Q number format or Qn.q.

  • Q denotes that the number is in the Q format notation
  • n is the number of bits of the register (total number of bits)
  • q is the number of bits used to designate the fractional part of the number == the number of bits to the right of the "binary point"
  • Q may be prefixed by U or S to designate a signed or unsigned number

Examples

UQ16.0 - unsigned number of 16-bits with zero fraction bits meaning integer [0 to 65535]
UQ16.8 - unsigned number of 16-bits with 8 fraction bits leaving 8 integer bits [0 to 255]
SQ16.15 - signed number of 16-bits with 15 fraction bits leaving 0 integer bits [-1 to 1)

Bellow with green we denote integer bits, with dark blue the fraction bits and with red("ish"), the sign bit. Between the green part and the blue part there is an "imaginary" binary point:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Point Signedness Q Value
0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 Fixed Unsigned UQ16.0 29596
0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 Fixed Unsigned UQ16.8 115.609375
0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 Fixed Signed SQ16.15 ~0.903198
0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 Float - - 15638.86

What we observe?

Since we have a fixed register size, we have a trade-off between the dynamic range of numbers that we can represent, and the precision of those numbers:

  • minimizing the dynamic range yields higher precision
  • minimizing the dynamic range yields higher precision

Dynamic Range

Dynamic range is the ratio between the largest number and the lowest positive number that can be represented using the given Qn.q format. It can be represented in dB (decibels) using the following formula:
Dynamic Range (DR) = 20 * log10( |max_val| / |min_val| )
Note: in FXP notation, dynamic range depends only on the register size. Let's understand why!

Example

For  Q16 .0  we have (we omit modulus):
max_value = 2 N-1 - 1 = 65535
min_value = 1 (the lowest positive number)
max_value / min_value = 65535
DR = 20 * log10 65535 ~ 96dB

For  Q16 .16  we have (we omit modulus):
max_value = 1 - 2 N-1
min_value = 2 -(N-1)
max_value / min_value = [1 / 2-(N-1)] - [2N-1 / 2-(N-1)] = 2N-1 - 20 = 2N-1 - 1 = 65535
So DR is the same: 20 * log10 65535 ~ 96dB

Format conversions

We will see that inside a CNN (Convolutional Neural Network) we often need to modify the format (can you guess the reason?), so it's important to understand how these conversions work.

Converting from low to large

Converting a FXP value into a larger format can be done in two ways:

  • increasing the integer part:
    • extend using the sign bit (for signed formats)
    • extend using zeroes (for unsigned formats)
  • increasing the fractional part
Increasing the integer part

In this example we consider a signed FXP number => we extend using the sign bit when converting from SQ16.8 to SQ32.8:

Increasing the fractional part

This is straight forward: we simply pad with zeros to the right:

Converting from large to low

Converting a FXP value into a smaller format can be done also in two ways:

  • truncation
  • rounding

The best way to explain is to actually see what happens with the numbers:

Truncation example

This is how an SQ32.24 can be truncated to an SQ16.1:

We observe that first we shift to the right until we represent the number with the desired precision. Then we sign-extend the remaining bits to the left.

Rounding example

Let's see the same example, but this time converted using the rounding method:

Fixed point math

We will first analyze the most important (and common) operations with FXP format.

Addition

Rule: In order to be able to add two FXP numbers, they must have the same format:

We can observe that the above operation is simply normal binary addition. What is particular is how we interpret the numbers, but this is something we already discussed.

Multiplication

Multiplication doesn't have this limitation. We can multiply any FXP format with any FXP format.

  • Rule 1: The total number of bits for the result = the sum of the number of bits of the operands.
  • Rule 2: The size of the fractional part for the result = the sum of the fractional sizes of the operands.
Example

Let's see an example for multiplying an SQ 16.8 with an SQ 16.4. If we apply the two rules we must obtain a signed number where the number of bits is as follows:

  • total number of bits = total_nr_bits_N1 + total_nr_bits_N2 = 16 + 16 = 32
  • number of fractional bit = fractional_size_N1 + fractional_size_N2 = 8 + 4 = 12

This means our result will have the format SQ32.12

By this point you realize that arithmetic operations for fixed point numbers only imply integer arithmetic => FP unit is not even used! You might even consider that integer representation is just a special case of FXP representation :)

We stop here with our short introduction. I hope you enjoyed this material!

What next?

In the next part we will start with the MAC (Multiply-Accumulate) operation and its importance in Convolutional Neural Networks (CNN). We will present the performance gain of running an inference on a modified AlexNet used for image segmentation, using FP32 versus FXP (various formats) deployed on an NVIDIA Drive PX2 and Jetson TX2. We will also see the limitations which appear when doing this kind of neural network transformation.

We are also preparing a material about synchronizing clocks with high precision using (g)PTP over Ethernet or CAN.

But first, I have a question for you:

What do you think works best:

  • 1. Training your network using FXP from the start and deploy it as it is
  • 2. Or training it using normal FP32 and transform it at deployment time?

About this newsletter

This is the first of a series of technical newsletters by Edocti . We want to keep it technical and concise. We present things that we encounter during our R&D projects . Our main focus is Industrial Autonomous Driving, robotics and Industrial IoT. We care about things like RTOS (QNX, Integrity, OSEK) and RT programming, Linux (embedded, Yocto, RT-Linux, POSIX, ...), Deep Learning, computer vision and OpenVX.