Glossary

TensorRT

Optimize deep learning models with TensorRT for faster, efficient inference on NVIDIA GPUs. Achieve real-time performance with YOLO and AI applications.

Train YOLO models simply
with Ultralytics HUB

Learn more

TensorRT is a high-performance Deep Learning (DL) inference optimizer and runtime library developed by NVIDIA. It's designed specifically to maximize the inference throughput and minimize inference latency for deep learning applications running on NVIDIA GPUs. TensorRT takes trained neural network models from various frameworks and applies numerous optimizations to generate a highly optimized runtime engine for deployment. This process is crucial for deploying models efficiently in production environments, especially where speed and responsiveness are critical.

Key Features and Optimizations

TensorRT achieves significant performance improvements through several sophisticated techniques:

  • Precision Calibration: Reduces model precision from FP32 to lower precisions like FP16 or INT8 (mixed precision or model quantization) with minimal loss in accuracy, leading to faster computation and lower memory usage.
  • Layer and Tensor Fusion: Combines multiple layers or operations into a single kernel (Layer Fusion), reducing memory bandwidth usage and kernel launch overhead.
  • Kernel Auto-Tuning: Selects the best pre-implemented algorithms (kernels) for the target NVIDIA GPU architecture, ensuring optimal performance for the specific hardware.
  • Dynamic Tensor Memory: Minimizes memory footprint by reusing memory allocated for tensors whose lifetime does not overlap.
  • Multi-Stream Execution: Enables parallel processing of multiple input streams.

How TensorRT Works

The workflow typically involves taking a trained model (e.g., from PyTorch or TensorFlow, often via an intermediate format like ONNX) and feeding it into the TensorRT optimizer. TensorRT parses the model, performs graph optimizations and target-specific optimizations based on the specified precision and target GPU, and finally generates an optimized inference plan, known as a TensorRT engine. This engine file can then be deployed for fast inference.

Relevance In AI And ML

TensorRT is highly relevant for the model deployment phase of the machine learning lifecycle. Its ability to significantly accelerate inference makes it indispensable for applications requiring real-time inference, such as object detection with models like Ultralytics YOLO, image segmentation, and natural language processing. It is a key component in the NVIDIA software stack, alongside tools like CUDA, enabling developers to leverage the full potential of NVIDIA hardware, from powerful data center GPUs to energy-efficient NVIDIA Jetson modules for Edge AI. Ultralytics provides seamless integration, allowing users to export YOLO models to TensorRT format for optimized deployment, often used with platforms like the Triton Inference Server.

Real-World Applications

TensorRT is widely used across various industries where fast and efficient AI inference is needed:

  1. Autonomous Vehicles: In self-driving cars (AI in Automotive), TensorRT optimizes perception models (like object detection and lane segmentation) running on embedded NVIDIA DRIVE platforms, ensuring real-time decision-making crucial for safety. Models like RTDETR can be optimized using TensorRT for deployment in such systems (RTDETRv2 vs YOLOv5 Comparison).
  2. Medical Image Analysis: Hospitals and research institutions use TensorRT to accelerate the inference of AI models that analyze medical scans (CT, MRI) for tasks like tumor detection or anomaly identification (AI in Healthcare), enabling faster diagnostics and supporting clinical workflows. This is often part of larger Computer Vision (CV) systems.
Read all