Glossary

Vision Transformer (ViT)

Discover the power of Vision Transformers (ViTs) in computer vision. Learn how they outperform CNNs by capturing global image context.

Train YOLO models simply
with Ultralytics HUB

Learn more

A Vision Transformer (ViT) is a type of neural network architecture adapted from the Transformer models originally designed for Natural Language Processing (NLP). Introduced by Google researchers in the paper "An Image is Worth 16x16 Words", ViTs apply the Transformer's self-attention mechanism directly to sequences of image patches, treating image processing as a sequence modeling task. This approach marked a significant shift from the dominance of Convolutional Neural Networks (CNNs) in computer vision (CV).

How Vision Transformers Work

Instead of processing images pixel by pixel using convolutional filters, a ViT first divides an input image into fixed-size, non-overlapping patches. These patches are then flattened into vectors, linearly embedded, and augmented with positional embeddings to retain spatial information (similar to how word positions are encoded in NLP). This sequence of vectors is then fed into a standard Transformer encoder, which uses layers of multi-head self-attention to weigh the importance of different patches relative to each other. The final output from the Transformer encoder is typically passed to a simple classification head (like a Multi-Layer Perceptron) for tasks like image classification. This architecture allows ViTs to model long-range dependencies and global context within an image effectively.

Relevance and Applications

Vision Transformers have become highly relevant in modern deep learning due to their scalability and impressive performance, particularly with large-scale pre-training on datasets like ImageNet or even larger proprietary datasets. Their ability to model global context makes them suitable for a wide range of CV tasks beyond basic classification, including:

ViTs are increasingly integrated into platforms like Ultralytics HUB and libraries such as Hugging Face Transformers, making them accessible for research and deployment using frameworks like PyTorch and TensorFlow. They can also be optimized for Edge AI deployment on devices like NVIDIA Jetson or Google's Edge TPU using tools like TensorRT.

ViT Vs. CNNs

While both ViTs and CNNs are foundational architectures in computer vision (see A History of Vision Models), they differ significantly in their approach:

  • Inductive Bias: CNNs possess strong inductive biases towards locality and translation equivariance through their convolution and pooling layers. ViTs have weaker inductive biases, relying more heavily on learning patterns from data, particularly the relationships between distant parts of an image via self-attention.
  • Data Dependency: ViTs generally require large amounts of training data (or extensive pre-training) to outperform state-of-the-art CNNs. With smaller datasets, CNNs often generalize better due to their built-in biases.
  • Computational Cost: Training ViTs can be computationally intensive, often requiring significant GPU resources. However, inference speed can be competitive, especially for larger models. RT-DETR models, for example, offer real-time performance but may have higher resource needs than comparable CNN-based YOLO models.
  • Global vs. Local Context: CNNs build up hierarchical features from local patterns. ViTs can model global interactions between patches from the earliest layers, potentially capturing broader context more effectively for certain tasks.

The choice between ViT and CNN often depends on the specific task, available datasets, and computational resources. ViTs generally excel when large amounts of training data are available and global context is paramount. CNNs, like those used as backbones in the Ultralytics YOLO family (e.g., YOLOv8, YOLOv10, YOLO11), remain highly effective and efficient, particularly for real-time object detection on constrained devices. Hybrid architectures combining convolutional features with transformer layers (like in RT-DETR) also represent a promising direction, attempting to leverage the strengths of both approaches. Fine-tuning pre-trained models, whether ViT or CNN-based, is a common practice using techniques like transfer learning.

Read all