Deep learning at the edge - part two: Semantic Segmentation on Drive PX2 & Jetson TX2
In Part 1 of this series, we established the mathematical foundation of Fixed-Point (FXP) arithmetic, specifically diving into the Qn.q format, dynamic range, and the underlying reasons why reducing bit-width is essential for deploying AI in embedded environments.
Now, it's time to bridge the gap between theory and automotive production. In this article, we will tackle the MAC (Multiply-Accumulate) operation—the beating heart of Convolutional Neural Networks (CNNs)—and demonstrate how we applied Fixed-Point (specifically INT8 and FP16) to a modified AlexNet for semantic image segmentation.
We'll look at the deployment process and the real-world performance gains achieved on the NVIDIA Drive PX2 and Jetson TX2.
The MAC Operation: Why Precision Costs Speed
In any CNN, the vast majority of computational time is spent executing millions of MAC (Multiply-Accumulate) operations during the convolution phases. A MAC operation essentially calculates the dot product of weights and input activations:
Accumulator = Accumulator + (Weight * Activation)When running in standard FP32 (32-bit Floating Point), the hardware requires complex logic to handle mantissa alignment, exponent addition, and normalization. As we proved in Part 1, by converting these numbers to an 8-bit or 16-bit Fixed-Point integer format (INT8 / FP16), the processor relies solely on simple integer arithmetic.
NVIDIA's Pascal architecture (found in both the TX2 and Drive PX2) heavily accelerates these operations. The discrete GPUs on the Drive PX2 utilize specialized dp4a instructions, allowing four 8-bit integer MAC operations to be executed in a single clock cycle, accumulating into a 32-bit integer. This drastically improves Operations Per Second (OPS) while slashing the memory bandwidth required to fetch weights.
The Challenge: Semantic Segmentation with FCN-AlexNet
For autonomous driving, object detection (bounding boxes) is often not enough. We need Semantic Segmentation—classifying every single pixel in the camera feed into categories like road, pedestrian, vehicle, or sidewalk.
We used a modified Fully Convolutional Network based on AlexNet (FCN-AlexNet). While standard AlexNet ends with dense fully-connected layers for classification, an FCN replaces them with 1x1 convolutions and upsampling (deconvolution) layers to output a dense pixel-wise classification map.
Running this pixel-heavy network at 30+ FPS in FP32 on embedded devices poses a massive thermal and computational challenge. This is where Post-Training Quantization (PTQ) using NVIDIA TensorRT comes into play.
From FP32 to INT8: The Calibration Process
To answer the question we left you with in Part 1: "Should you train in FXP from the start, or train in FP32 and transform at deployment?"
While Quantization-Aware Training (QAT) is gaining traction, the most robust, industry-standard approach for these architectures is Post-Training Quantization. We train the network normally in FP32 (using frameworks like TensorFlow or PyTorch), freeze the weights, and use TensorRT to convert the model to INT8 just before deployment.
However, you cannot simply truncate FP32 to INT8. You need to map the Floating-Point Dynamic Range to the limited 8-bit range [-128, 127]. If you blindly map the absolute maximum value to 127, outliers will compress all the important data into just a few bits, destroying accuracy.
TensorRT Entropy Calibration
TensorRT solves this using an Entropy Calibrator (based on Kullback-Leibler divergence). You feed the calibrator a representative batch of your dataset (e.g., 500 images of driving scenarios). TensorRT observes the distribution of activations across each layer and calculates the optimal scaling factor that minimizes information loss.
Here is a conceptual C++ snippet showing how to configure the TensorRT builder for INT8 calibration on a Drive PX2:
// 1. Create the builder and network
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
// 2. Enable INT8 mode
builder->setInt8Mode(true);
// 3. Provide the custom Entropy Calibrator (Inherited from IInt8EntropyCalibrator)
Int8EntropyCalibrator* calibrator = new Int8EntropyCalibrator(calibrationBatchStream, "calibration_table.cache");
builder->setInt8Calibrator(calibrator);
// 4. Build the highly-optimized execution engine
ICudaEngine* engine = builder->buildCudaEngine(*network);
Performance Gains: Drive PX2 and Jetson TX2
Once the INT8 calibration table was generated, we deployed the optimized FCN-AlexNet engine. The Jetson TX2 (Parker SoC) shines brilliantly in FP16 (half-precision), offering native 2x throughput compared to FP32. The Drive PX2 AutoCruise (with its discrete Pascal GPUs) leverages full INT8 optimization.
Here are the normalized performance metrics observed during our autonomous driving tests (processing 1280x720 camera frames):
| Platform | Precision | Memory Footprint (Weights) | Inference Latency (ms) | FPS (Approx) | Accuracy Drop (mIoU) |
|---|---|---|---|---|---|
| Jetson TX2 | FP32 | ~240 MB | ~85 ms | 11 FPS | Baseline |
| Jetson TX2 | FP16 | ~120 MB | ~45 ms | 22 FPS | < 0.1% |
| Drive PX2 (dGPU) | FP32 | ~240 MB | ~40 ms | 25 FPS | Baseline |
| Drive PX2 (dGPU) | INT8 | ~60 MB | ~12 ms | 80+ FPS | ~0.8% - 1.2% |
Observations and Limitations
- Latency & Memory: Moving to INT8 reduced the model size by exactly 4x (from 32 bits to 8 bits per weight), drastically reducing memory bus bottlenecks. Latency improved by over 3x on the Drive PX2.
- Accuracy: The Mean Intersection over Union (mIoU) drop was negligible (around 1%). However, during edge-case visual inspections, we noticed slight degradation on very distant, small objects (e.g., pedestrians far down the road). This happens because INT8 lacks the dynamic range to resolve tiny activation variations deep in the network.
- The Calibration Set: The quality of the INT8 model is heavily dependent on the calibration dataset. If your calibration set doesn't include night-time driving, the network's dynamic range will be misaligned when driving in the dark, leading to erratic segmentation.
Conclusion
Fixed-point arithmetic is not just a theoretical concept; it is the absolute standard for deploying production-grade AI in the automotive industry. By leveraging TensorRT and hardware like the Drive PX2 or Jetson TX2, we can transform slow, heavy FP32 models into lightning-fast INT8 engines without compromising the safety and accuracy required for autonomous driving.
In our upcoming newsletters, we will dive into systems engineering—specifically, synchronizing clocks with high precision using (g)PTP over Ethernet, a critical component for sensor fusion.
Ready to master Deep Learning at the Edge?
If you found these optimization techniques valuable, take the next step in your career. Learn how to architect, train, and deploy high-performance computer vision models directly on embedded hardware.
Explore the Deep Learning & Computer Vision CourseAbout Edocti R&D
This newsletter is part of a technical series by Edocti . We share practical insights gathered directly from our engineering trenches. Our primary focus is Industrial Autonomous Driving, Robotics, and Industrial IoT. We specialize in RTOS (QNX, Integrity, OSEK), Linux internals, Deep Learning, and Computer Vision architecture for Tier-1 suppliers.