Glossary

Transformer

Discover how Transformer architectures revolutionize AI, powering breakthroughs in NLP, computer vision, and advanced ML tasks.

Train YOLO models simply
with Ultralytics HUB

Learn more

Transformers represent a pivotal neural network architecture that has significantly advanced the fields of artificial intelligence (AI) and machine learning (ML), especially in natural language processing (NLP) and increasingly in computer vision (CV). Introduced in the influential paper "Attention Is All You Need", they process sequential data, like text or time series, using a mechanism called self-attention. This allows the model to dynamically weigh the importance of different parts of the input, overcoming key limitations of older architectures like Recurrent Neural Networks (RNNs).

How Transformers Work

The core innovation of Transformers is the self-attention mechanism. Unlike RNNs, which process input sequentially (one element after another) and can struggle with long sequences due to issues like vanishing gradients, Transformers can consider all parts of the input sequence simultaneously. This parallel processing capability significantly speeds up training on modern hardware like GPUs from companies like NVIDIA.

While typical Convolutional Neural Networks (CNNs) focus on local features through fixed-size filters performing convolution, the attention mechanism allows Transformers to capture long-range dependencies and contextual relationships across the entire input. This ability to understand global context is crucial for tasks involving complex relationships, whether in text or image patches used in Vision Transformers (ViTs).

Relevance and Impact

Transformers have become the foundation for many state-of-the-art AI models due to their effectiveness in capturing context and handling long sequences. Their parallelizable nature has enabled the training of massive models with billions of parameters, such as GPT-3 and GPT-4 developed by OpenAI, leading to breakthroughs in generative AI. This scalability and performance have made Transformers central to progress in various AI tasks, driving innovation across research and industry. Many popular Transformer models, like BERT, are readily available through platforms like Hugging Face and implemented using frameworks such as PyTorch and TensorFlow, often integrated into MLOps platforms like Ultralytics HUB.

Applications in AI and ML

Transformers are highly versatile and power numerous AI applications:

Transformer vs. Other Architectures

It's helpful to distinguish Transformers from other common neural network architectures:

  • Transformers vs. RNNs: RNNs process data sequentially, making them suitable for time-series data but prone to forgetting earlier information in long sequences (vanishing gradient problem). Transformers process sequences in parallel using self-attention, capturing long-range dependencies more effectively and training faster on parallel hardware (GPUs).
  • Transformers vs. CNNs: CNNs excel at identifying local patterns in grid-like data (e.g., pixels in an image) using convolutional filters. They are highly efficient for many vision tasks like those addressed by Ultralytics YOLO models. Transformers, particularly ViTs, divide images into patches and use self-attention to model relationships between them, capturing global context potentially better but often requiring more data and computational resources, especially during model training. Hybrid architectures, combining CNN features with Transformer layers, aim to leverage the strengths of both, as seen in some RT-DETR variants. The choice often depends on the specific task, dataset size, and available compute resources.
Read all