Glossary

Sequence-to-Sequence Models

Discover how sequence-to-sequence models transform input to output sequences, powering AI tasks like translation, chatbots, and speech recognition.

Train YOLO models simply
with Ultralytics HUB

Learn more

Sequence-to-Sequence (Seq2Seq) models are a class of deep learning architectures designed to transform an input sequence into an output sequence, where the lengths of the input and output sequences may differ. Initially developed using Recurrent Neural Networks (RNNs), these models form the basis for many tasks involving sequential data, particularly in Natural Language Processing (NLP). The core idea is to map sequences like sentences, audio clips, or time series data from one domain to another.

How Sequence-to-Sequence Models Work

Seq2Seq models typically consist of two main components: an encoder and a decoder.

  1. Encoder: This part processes the entire input sequence (e.g., a sentence in French) step-by-step. At each step, it updates its internal hidden state. The final hidden state, often called the "context vector" or "thought vector," aims to capture a summary or the essence of the input sequence. Early Seq2Seq models used RNNs or LSTMs for this purpose, as detailed in the original Sequence to Sequence Learning paper.
  2. Decoder: This component takes the final context vector from the encoder and generates the output sequence step-by-step (e.g., the translated sentence in English). It uses the context vector as its initial state and produces one element of the output sequence at each time step, updating its own hidden state as it goes.

A key innovation that significantly improved Seq2Seq performance, especially for longer sequences, was the Attention Mechanism. Attention allows the decoder to look back at different parts of the input sequence's hidden states (not just the final context vector) when generating each output element, weighing their importance dynamically, as proposed by Bahdanau et al..

Relevance and Evolution

Seq2Seq models represented a major breakthrough, particularly for tasks where input and output lengths are variable and alignment is complex. They provided a flexible framework for handling diverse sequence transformation problems. While foundational, the original RNN-based Seq2Seq models faced challenges with long-range dependencies. This led to the development of Transformer models, which rely entirely on attention mechanisms and parallel processing, largely replacing RNNs for state-of-the-art performance in many sequence tasks. However, the core encoder-decoder concept remains influential. Frameworks like PyTorch and TensorFlow provide robust tools for building both traditional Seq2Seq and modern Transformer models.

Applications in AI and ML

Seq2Seq models, including their modern Transformer-based successors, are used in numerous applications:

  • Machine Translation: Translating text from a source language to a target language (e.g., powering services like Google Translate).
  • Text Summarization: Generating shorter summaries from long articles or documents.
  • Chatbots and Question Answering: Generating conversational responses or answers based on input text or questions. Many modern chatbots leverage advanced Transformer architectures like GPT-4.
  • Speech Recognition: Converting sequences of audio features into sequences of text (transcription).
  • Image Captioning: Generating textual descriptions (sequences of words) for input images. While distinct from object detection tasks performed by models like Ultralytics YOLO, it involves mapping visual input to sequential output. Research at institutions like the Stanford NLP Group often explores these areas.

While Seq2Seq models are primarily associated with NLP, attention mechanisms inspired by them are also finding use in computer vision, for example, within certain components of detection models like RT-DETR or in Vision Transformers. You can explore various models on platforms like Hugging Face.

Read all