Glossary

TensorRT

Optimize deep learning models with TensorRT for faster, efficient inference on NVIDIA GPUs. Achieve real-time performance with YOLO and AI applications.

Train YOLO models simply
with Ultralytics HUB

Learn more

TensorRT is a high-performance deep learning inference optimizer and runtime library developed by NVIDIA. It accelerates deep learning models on NVIDIA Graphics Processing Units (GPUs) by applying various optimization techniques. The primary goal of TensorRT is to achieve the lowest possible inference latency and highest throughput for models deployed in production environments, making it crucial for real-time inference applications.

How TensorRT Works

TensorRT takes a trained neural network, often exported from frameworks like PyTorch or TensorFlow, and optimizes it specifically for the target NVIDIA GPU. Key optimization steps include:

  • Graph Optimization: Fusing layers and eliminating redundant operations to create a more efficient computation graph.
  • Precision Calibration: Reducing the numerical precision of model weights (e.g., from FP32 to FP16 or INT8) with minimal impact on accuracy, which significantly speeds up calculations and reduces memory usage.
  • Kernel Auto-Tuning: Selecting the best pre-implemented algorithms (kernels) from NVIDIA's libraries (cuDNN, cuBLAS) for the specific model layers and target GPU.
  • Dynamic Tensor Memory: Minimizing memory footprint by reusing memory allocated for tensors.

These optimizations result in a highly efficient runtime inference engine tailored for the specific model and hardware.

Relevance to Ultralytics

TensorRT is a key deployment target for Ultralytics YOLO models. Users can export their trained Ultralytics YOLO models to the TensorRT format to achieve significant speedups on NVIDIA hardware, including edge devices like NVIDIA Jetson. This enables high-performance applications in various fields. Model comparison pages, such as the YOLOv5 vs RT-DETR comparison, often showcase inference speeds achieved using TensorRT optimization. Ultralytics also provides guides for integrating with NVIDIA platforms, like the DeepStream on NVIDIA Jetson guide.

Real-World Applications

TensorRT is widely used where fast and efficient inference on NVIDIA hardware is critical:

  1. Autonomous Vehicles: Self-driving cars rely on processing vast amounts of sensor data in real-time. TensorRT accelerates models for object detection, segmentation, and path planning, enabling quick decision-making essential for safety. This is a core component of AI in automotive solutions.
  2. Video Analytics and Smart Cities: Processing multiple high-resolution video streams for tasks like traffic monitoring, crowd analysis, or security surveillance requires immense computational power. TensorRT optimizes models like Ultralytics YOLOv8 to handle these demanding workloads efficiently on servers or edge devices, powering AI solutions for smart cities.

TensorRT vs. Similar Terms

  • ONNX (Open Neural Network Exchange): ONNX is an open format for representing deep learning models. While TensorRT can import models from the ONNX format, ONNX itself is hardware-agnostic, whereas TensorRT is specifically an optimizer and runtime for NVIDIA GPUs. Ultralytics models can be exported to ONNX.
  • OpenVINO: Similar to TensorRT, OpenVINO is an inference optimization toolkit, but it is developed by Intel and primarily targets Intel hardware (CPUs, iGPUs, VPUs). Learn more about Ultralytics OpenVINO integration.
  • PyTorch / TensorFlow: These are deep learning frameworks used primarily for training models. TensorRT optimizes models after they have been trained using these frameworks, preparing them for efficient model deployment.
Read all