Glossary

Vision Transformer (ViT)

Discover the power of Vision Transformers (ViTs) in computer vision. Learn how they outperform CNNs by capturing global image context.

A Vision Transformer (ViT) is a type of neural network architecture that applies the highly successful Transformer model, originally designed for natural language processing (NLP), to computer vision (CV) tasks. Introduced by Google researchers in the paper "An Image is Worth 16x16 Words", ViTs represent a significant departure from the dominant Convolutional Neural Network (CNN) architectures. Instead of processing images with sliding filters, a ViT treats an image as a sequence of patches, enabling it to capture global relationships between different parts of an image using the self-attention mechanism.

How Vision Transformers Work

The core idea behind a ViT is to process an image in a way that mimics how Transformers process text. The process involves a few key steps:

Image Patching: The input image is first split into a grid of fixed-size, non-overlapping patches. For example, a 224x224 pixel image might be divided into 196 patches, each 16x16 pixels.
Patch Embedding: Each patch is flattened into a single vector. These vectors are then projected into a lower-dimensional space to create "patch embeddings." A learnable "positional embedding" is added to each patch embedding to retain spatial information.
Transformer Encoder: This sequence of embeddings is fed into a standard Transformer encoder. Through its self-attention layers, the model learns the relationships between all pairs of patches, allowing it to capture global context across the entire image from the very first layer.
Classification Head: For tasks like image classification, an extra learnable embedding (similar to the [CLS] token in BERT) is added to the sequence. The corresponding output from the Transformer is passed to a final classification layer to produce the prediction.

ViT Vs. CNNs

While both ViTs and CNNs are foundational architectures in computer vision, they differ significantly in their approach:

Inductive Bias: CNNs possess strong inductive biases (assumptions about the data) like locality and translation equivariance through their convolution and pooling layers. ViTs have much weaker inductive biases, making them more flexible but also more dependent on learning patterns directly from data.
Data Dependency: Due to their weaker biases, ViTs generally require massive datasets (e.g., ImageNet-21k) or extensive pre-training to outperform state-of-the-art CNNs. With smaller datasets, CNNs often generalize better. This is why transfer learning is critical for ViTs.
Global vs. Local Context: CNNs build up hierarchical features from local patterns to global ones. In contrast, ViTs can model global interactions between patches from the earliest layers, potentially capturing broader context more effectively for certain tasks.
Computational Cost: Training ViTs can be computationally intensive, often requiring significant GPU resources. Frameworks like PyTorch and TensorFlow provide implementations for training these models.

Applications and Hybrid Models

ViTs have shown exceptional performance in various applications, especially where understanding global context is key.

Medical Image Analysis: ViTs are highly effective for analyzing medical scans like MRIs or histopathology images. For example, in tumor detection, a ViT can identify relationships between distant tissues, helping to classify tumors more accurately than models that focus only on local textures.
Autonomous Driving: In self-driving cars, ViTs can analyze complex scenes for object detection and segmentation. By processing the entire scene globally, they can better understand the interactions between vehicles, pedestrians, and infrastructure, as detailed by multiple automotive AI studies.

The success of ViTs has also inspired hybrid architectures. Models like RT-DETR combine a CNN backbone for efficient feature extraction with a Transformer-based encoder-decoder to model object relationships. This approach aims to get the best of both worlds: the efficiency of CNNs and the global context awareness of Transformers.

For many real-time applications, especially on resource-constrained edge devices, highly optimized CNN-based models like the Ultralytics YOLO family (e.g., YOLOv8 and YOLO11) often provide a better balance of speed and accuracy. You can see a detailed comparison between RT-DETR and YOLO11 to understand the trade-offs. The choice between a ViT and a CNN ultimately depends on the specific task, available data, and computational budget.

Vision Transformer (ViT)

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

How Vision Transformers Work

ViT Vs. CNNs

Applications and Hybrid Models

Read more in this category

Exploring OpenAI's GPT-5: A smart unified system

Google AlphaEarth uses observation data for global mapping

FastVLM: Apple Introduces its new fast vision language model

Join the Ultralytics community