Glossary

Vision Transformer (ViT)

Discover the power of Vision Transformers (ViTs) in computer vision. Learn how they outperform CNNs by capturing global image context.

Train YOLO models simply
with Ultralytics HUB

Learn more

Vision Transformer (ViT) represents a significant shift in the field of computer vision, adapting the Transformer architecture, originally developed for natural language processing, to image recognition tasks. Unlike traditional Convolutional Neural Networks (CNNs) that process images layer by layer, ViTs break down an image into smaller patches and treat these patches as tokens in a sequence, much like words in a sentence. This novel approach allows ViTs to leverage the Transformer's powerful self-attention mechanism to capture global relationships within an image, leading to state-of-the-art performance in various computer vision tasks.

How Vision Transformers Work

At its core, a Vision Transformer processes images by first dividing them into a grid of fixed-size patches. These patches are then flattened and linearly transformed into embeddings, which are essentially vector representations. Positional embeddings are added to these patch embeddings to retain spatial information, crucial for understanding image structure. This sequence of embedded patches is then fed into a standard Transformer encoder.

The Transformer encoder consists of multiple layers of multi-head self-attention and feed-forward networks. The key component here is the self-attention mechanism, which allows the model to weigh the importance of each patch relative to all other patches when processing the image. This enables the ViT to understand the global context of the image, capturing long-range dependencies that might be missed by CNNs focusing on local features. This global context understanding is a primary strength of Vision Transformers. For a deeper dive into the underlying principles, resources like Jay Allammar's "The Illustrated Transformer" provide excellent visual explanations of the Transformer architecture.

Relevance and Applications

Vision Transformers have rapidly gained prominence due to their impressive performance and scalability. Their ability to capture global context and their capacity to benefit from large datasets have made them highly relevant in modern deep learning applications. Key applications of ViTs include:

  • Image Classification: ViTs have achieved top results on image classification benchmarks, often surpassing the performance of traditional CNN-based models. Their architecture is particularly effective when trained on large datasets like ImageNet.
  • Object Detection: Vision Transformers are increasingly used as backbones in object detection frameworks. Models like RT-DETR by Ultralytics leverage Vision Transformers to achieve real-time performance with high accuracy.
  • Image Segmentation: ViTs are also effective in image segmentation tasks, enabling precise pixel-level classification for applications like medical image analysis and autonomous driving. For instance, the Segment Anything Model (SAM) utilizes a ViT backbone for its powerful segmentation capabilities.

Real-world applications span various industries. In healthcare, ViTs aid in medical image analysis for improved diagnostics. In agriculture, they enhance crop monitoring and disease detection. Furthermore, their efficiency and accuracy make them suitable for deployment on edge devices, as explored in guides for NVIDIA Jetson and Raspberry Pi.

Vision Transformers vs. CNNs

While CNNs have long been the dominant architecture in computer vision, Vision Transformers offer a fundamentally different approach. CNNs excel at capturing local patterns through convolutional layers, making them efficient for tasks where local features are crucial. However, they can sometimes struggle with capturing long-range dependencies and global context. ViTs, on the other hand, inherently capture global context through their self-attention mechanisms, providing an advantage in tasks requiring a holistic understanding of the scene.

Despite their strengths, ViTs typically require significantly larger datasets for training compared to CNNs to achieve optimal performance. CNNs can be more computationally efficient for smaller datasets and tasks focused on local feature extraction. The choice between ViTs and CNNs often depends on the specific application, dataset size, and computational resources available. Vision Transformers represent a significant evolution in computer vision, demonstrating the power of attention mechanisms and paving the way for future advancements in the field.

Read all