Glossary

Vision Transformer (ViT)

Discover the power of Vision Transformers (ViTs) in computer vision. Learn how they outperform CNNs by capturing global image context.

Train YOLO models simply
with Ultralytics HUB

Learn more

Vision Transformer (ViT) marks a pivotal development in computer vision (CV), applying the highly successful Transformer architecture, initially designed for natural language processing (NLP), to image-based tasks. Unlike traditional Convolutional Neural Networks (CNNs) that process images using localized filters layer by layer, ViTs divide an image into fixed-size patches, treat them as a sequence of tokens (similar to words in a sentence), and process them using the Transformer's self-attention mechanism. This allows ViTs to capture global context and long-range dependencies within an image more effectively than many CNN architectures, leading to state-of-the-art results on various benchmarks, especially when trained on large datasets like ImageNet.

How Vision Transformers Work

The core idea behind ViT involves reshaping the image processing paradigm. An input image is first split into a grid of non-overlapping patches. Each patch is flattened into a vector and then linearly projected into an embedding space. To retain spatial information, positional embeddings are added to these patch embeddings. This sequence of vectors, now representing the image patches with their positions, is fed into a standard Transformer encoder, as detailed in the original "An Image is Worth 16x16 Words" paper.

The Transformer encoder, composed of multiple layers, relies heavily on the self-attention mechanism. This mechanism enables the model to weigh the importance of different patches relative to each other dynamically, allowing it to learn relationships between distant parts of the image. This global receptive field contrasts with the typically local receptive field of CNNs, giving ViTs an advantage in understanding the overall scene context. Resources like The Illustrated Transformer offer intuitive explanations of the underlying Transformer concepts. Frameworks like PyTorch and TensorFlow provide implementations of these components.

Relevance and Applications

Vision Transformers have become highly relevant in modern deep learning due to their scalability and impressive performance, particularly with large-scale pre-training. Their ability to model global context makes them suitable for a wide range of CV tasks beyond basic image classification, including:

  • Object Detection: Models like RT-DETR, often utilizing transformer components, achieve high accuracy in locating objects.
  • Image Segmentation: ViTs can be adapted for dense prediction tasks like segmenting different objects or regions within an image, as seen in models like the Segment Anything Model (SAM).
  • Medical Image Analysis: ViTs help detect subtle patterns indicative of diseases in scans like X-rays or MRIs, potentially improving diagnostic accuracy in AI in Healthcare. For instance, they can identify complex tumor shapes or distributions that might be challenging for locally focused models.
  • Autonomous Vehicles: Understanding the entire traffic scene, including distant vehicles, pedestrians, and traffic signals, is crucial for safe navigation. ViTs' global context modeling aids in comprehensive scene understanding for AI in Automotive applications.

ViTs are increasingly integrated into platforms like Ultralytics HUB and libraries such as Hugging Face Transformers, making them accessible for research and deployment. They can also be optimized for Edge AI deployment on devices like NVIDIA Jetson.

ViT Vs. CNNs

While both ViTs and CNNs are foundational architectures in computer vision (see A History of Vision Models), they differ significantly in their approach:

  • Processing: CNNs use convolution operations with sliding kernels, focusing on local patterns and building hierarchical features. ViTs use self-attention across image patches, focusing on global relationships from the start.
  • Inductive Bias: CNNs have strong built-in inductive biases (like locality and translation equivariance) suitable for images, often making them more data-efficient in smaller datasets. ViTs have weaker inductive biases and typically require larger datasets (or sophisticated pre-training strategies like those used by CLIP) to generalize well.
  • Architecture: CNNs consist of convolutional layers, pooling layers, and fully connected layers. ViTs adopt the standard Transformer encoder structure. Hybrid models, combining convolutional backbones with Transformer heads/necks (e.g., RT-DETR variants), aim to leverage the strengths of both.

The choice between ViT and CNN often depends on the specific task, available data, and computational resources. ViTs generally excel when large amounts of training data are available, while CNNs like those in the Ultralytics YOLO family remain highly effective and efficient, particularly for real-time object detection on constrained devices.

Read all