Glossary

Tokenization

Discover the power of tokenization in NLP and ML! Learn how breaking text into tokens enhances AI tasks like sentiment analysis and text generation.

Tokenization is the foundational process of breaking down a stream of data, such as raw text or an image, into smaller, discrete units called tokens. This is a critical first step in the data preprocessing pipeline for nearly all Artificial Intelligence (AI) systems. By converting unstructured data into a standardized format, tokenization enables machine learning models to interpret, analyze, and learn patterns effectively. Without this step, most models would be unable to process the vast and varied data that fuels modern AI applications.

Relevance and Real-World Applications

Tokenization is crucial because most deep learning architectures require numerical input rather than raw text or pixels. By converting data into discrete tokens, we can then map these tokens to numerical representations, such as embeddings. These numerical vectors capture semantic meaning and relationships, allowing models built with frameworks like PyTorch or TensorFlow to learn from the data. This foundational step underpins numerous AI applications:

Natural Language Processing (NLP): Tokenization is central to almost all NLP tasks.
- Machine Translation: Services like Google Translate tokenize the input sentence in the source language, process these tokens using complex models (often based on the Transformer architecture), and then generate tokens in the target language, which are finally assembled into the translated sentence.
- Sentiment Analysis: To determine if a customer review is positive or negative, the text is first tokenized. The model then analyzes these tokens to classify the overall sentiment. Learn more about Sentiment Analysis. Techniques like prompt tuning also rely on manipulating token sequences. For developers, libraries like spaCy and NLTK offer powerful tokenization tools.
Computer Vision (CV): While traditionally associated with NLP, the concept extends to computer vision.
- Vision Transformers (ViT): In models like Vision Transformers (ViT), images are divided into fixed-size patches. As explained in the original ViT research paper, these patches are treated as 'visual tokens' and flattened into sequences. These sequences are then fed into a Transformer network, which uses mechanisms like self-attention to understand relationships between different image parts. This enables tasks like image classification and object detection.
- Multimodal Models: Models like CLIP and YOLO-World bridge vision and language by processing both text tokens and visual tokens to perform tasks like zero-shot object detection. Similarly, advanced image segmentation models like the Segment Anything Model (SAM) also utilize token-like concepts.

Common Tokenization Methods

Different strategies exist for tokenizing data, each with its own trade-offs. The choice of method can significantly impact model performance.

Word-Based Tokenization: This method splits text based on spaces and punctuation. While simple and intuitive, it struggles with large vocabularies and "out-of-vocabulary" words (words not seen during training).
Character-Based Tokenization: This method breaks text into individual characters. It solves the out-of-vocabulary problem but can result in very long sequences that lose high-level semantic meaning, making it harder for models to learn relationships between words.
Subword Tokenization: This is a hybrid approach that has become the standard for modern NLP models. It breaks words into smaller, meaningful sub-units. Common words remain as single tokens, while rare words are split into multiple subword tokens. This method efficiently handles complex words and avoids the out-of-vocabulary issue. Popular algorithms include Byte Pair Encoding (BPE) and WordPiece, which are used in models like BERT and GPT.

Tokenization vs. Tokens

It's important to distinguish between 'Tokenization' and a 'Token'.

Tokenization: Refers to the process of breaking down data into smaller units. It's a preprocessing step that is fundamental to how language models work.
Token: Refers to the result of the tokenization process – the individual unit (word, subword, character, or image patch) that the model processes.

Understanding tokenization is fundamental to grasping how AI models interpret and learn from diverse data types. Managing datasets and training models often involves platforms like Ultralytics HUB, which help streamline data preprocessing and model training workflows. As AI evolves, tokenization methods continue to adapt, playing a key role in building more sophisticated models for tasks ranging from text generation to complex visual understanding in fields like autonomous vehicles and medical image analysis.

Tokenization

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

Relevance and Real-World Applications

Common Tokenization Methods

Tokenization vs. Tokens

Read more in this category

The evolution and future of robotics in manufacturing

Enhance smart surveillance with Ultralytics YOLO11

A guide on U-Net architecture and its applications

Join the Ultralytics community