Glossary

Tokenization

Discover the power of tokenization in NLP and ML! Learn how breaking text into tokens enhances AI tasks like sentiment analysis and text generation.

Train YOLO models simply
with Ultralytics HUB

Learn more

Tokenization is a fundamental preprocessing step in Artificial Intelligence (AI) and Machine Learning (ML), particularly vital in Natural Language Processing (NLP). It involves breaking down sequences of text or other data into smaller, manageable units called tokens. These tokens serve as the basic building blocks that algorithms use to understand and process information, transforming raw input into a format suitable for analysis.

How Tokenization Works

The core idea behind tokenization is segmentation. For text data, this typically means splitting sentences into words, subwords, or even individual characters based on predefined rules or learned patterns. For example, the sentence "Ultralytics YOLOv8 is powerful" might be tokenized into: ["Ultralytics", "YOLOv8", "is", "powerful"]. The specific method chosen depends on the task and the model architecture. Common techniques include splitting by whitespace and punctuation, or using more advanced methods like Byte Pair Encoding (BPE) or WordPiece, which are often used in Large Language Models (LLMs) like BERT to handle large vocabularies and unknown words effectively.

Relevance and Real-World Applications

Tokenization is essential because most ML models require numerical input. By converting text into discrete tokens, we can then map these tokens to numerical representations like embeddings, allowing models to learn patterns and relationships within the data. This process underpins numerous AI applications:

  1. Machine Translation: Services like Google Translate tokenize input sentences in the source language into tokens, process these tokens using complex neural networks (often Transformers), and then generate tokens in the target language, which are finally assembled back into sentences. Accurate tokenization ensures that linguistic nuances are captured correctly.
  2. Sentiment Analysis: To determine the sentiment of a customer review like "The service was excellent!", the text is first tokenized (["The", "service", "was", "excellent", "!"]). Each token is then analyzed, often using its embedding, allowing the model to classify the overall sentiment as positive, negative, or neutral. This is crucial for businesses analyzing customer feedback. Learn more about Sentiment Analysis.
  3. Vision-Language Models: Models like CLIP or Ultralytics YOLO-World rely on tokenizing text prompts to understand user queries for tasks like zero-shot object detection or image segmentation. The text tokens are linked with visual features learned from images.

Tokenization in Computer Vision

While traditionally associated with NLP, the concept extends to Computer Vision (CV). In Vision Transformers (ViT), images are divided into fixed-size patches, which are treated as 'visual tokens'. These tokens are then processed similarly to text tokens in NLP transformers, enabling models to understand spatial hierarchies and context within images.

Benefits and Tools

Effective tokenization standardizes input data, simplifies processing for models, and helps manage vocabulary size, especially with subword methods. Libraries like Hugging Face Tokenizers and toolkits like NLTK provide robust implementations. Platforms like Ultralytics HUB often abstract away the complexities of data preprocessing, including tokenization, streamlining the workflow for training models built with frameworks like PyTorch or TensorFlow. Understanding tokenization is key to building and optimizing many modern AI systems.

Read all