Quantization-Aware Training (QAT)

Optimize AI models for edge devices with Quantization-Aware Training (QAT), ensuring high accuracy and efficiency in resource-limited environments.

Quantization-Aware Training (QAT) is an advanced model optimization technique that prepares a neural network (NN) for deployment with lower numerical precision. Unlike standard training that uses 32-bit floating-point numbers (FP32), QAT simulates the effects of 8-bit integer (INT8) computations during the training or fine-tuning process. By making the model "aware" of the quantization errors it will encounter during inference, QAT allows the model to adjust its weights to minimize the potential loss in accuracy. This results in a compact, efficient model that maintains high performance, making it ideal for deployment on resource-constrained hardware.

How Quantization-Aware Training Works

The QAT process typically begins with a pre-trained FP32 model. "Fake" quantization nodes are inserted into the model's architecture, which mimic the effect of converting floating-point values to lower-precision integers and back. The model is then retrained on a training dataset. During this retraining phase, the model learns to adapt to the information loss associated with quantization through standard backpropagation. This allows the model to find a more robust set of weights that are less sensitive to the reduced precision. Leading deep learning frameworks such as PyTorch and TensorFlow offer robust tools and APIs to implement QAT workflows.

QAT vs. Post-Training Quantization

QAT is often compared to Post-Training Quantization (PTQ), another common model quantization method. The key difference lies in when the quantization is applied.

Post-Training Quantization (PTQ): This method is applied after the model has been fully trained. It's a simpler and faster process that doesn't require retraining or access to the original training data. However, it can sometimes lead to a significant drop in model accuracy, especially for sensitive models.
Quantization-Aware Training (QAT): This method integrates quantization into the training loop. While it is more computationally intensive and requires access to training data, QAT almost always results in higher accuracy for the final quantized model compared to PTQ. It is the preferred method when maximizing performance is critical.

Real-World Applications of QAT

Quantization-Aware Training is vital for deploying sophisticated AI models in resource-constrained environments where efficiency is key.

On-Device Computer Vision: Running complex computer vision models like Ultralytics YOLOv8 directly on smartphones for applications like real-time object detection in augmented reality apps or image classification within photo management tools. QAT allows these models to run efficiently without significant battery drain or latency.
Edge AI in Automotive and Robotics: Deploying models for tasks like pedestrian detection or lane keeping assist in autonomous vehicles or for object manipulation in robotics. QAT enables these models to run on specialized hardware like Google Edge TPUs or NVIDIA Jetson, ensuring low inference latency for critical real-time decisions. This is crucial for applications like security alarm systems or parking management.

Relationship With Other Optimization Techniques

QAT is one of several techniques for model deployment optimization and is often used alongside others for maximum efficiency.

Model Pruning: Involves removing redundant or unimportant connections from the network. A model can be pruned first and then undergo QAT to achieve even greater compression.
Knowledge Distillation: Trains a smaller "student" model to mimic a larger "teacher" model. The resulting student model can then be further optimized using QAT.

Ultralytics supports exporting models to various formats like ONNX, TensorRT, and TFLite, which are compatible with QAT workflows, enabling efficient deployment across diverse hardware from companies like Intel and NVIDIA. You can manage and deploy your QAT-optimized models using platforms like Ultralytics HUB. Evaluating model performance using relevant metrics after QAT is essential to ensure accuracy requirements are met.

Quantization-Aware Training (QAT)

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

How Quantization-Aware Training Works

QAT vs. Post-Training Quantization

Real-World Applications of QAT

Relationship With Other Optimization Techniques

Read more in this category

Manufacturing ERP Guide

Manufacturing execution system (MES): AI-driven production

Understanding additive manufacturing: Technology & use cases

Join the Ultralytics community