Glossary

Knowledge Distillation

Discover how Knowledge Distillation compresses AI models for faster inference, improved accuracy, and edge device deployment efficiency.

Train YOLO models simply
with Ultralytics HUB

Learn more

Knowledge Distillation is a technique in Machine Learning (ML) where a smaller, compact model (the "student") is trained to replicate the behavior of a larger, more complex model (the "teacher"). The primary goal is to transfer the "knowledge" learned by the large teacher model to the smaller student model, enabling the student to achieve comparable performance while being significantly more efficient in terms of size and computational cost. This is particularly useful for deploying models in resource-constrained environments like mobile devices or edge AI systems.

How Knowledge Distillation Works

The core idea behind Knowledge Distillation involves training the student model not just on the ground truth labels (hard targets) used to train the original teacher model, but also on the outputs generated by the teacher model itself. Often, these teacher outputs are "soft targets"—class probabilities or distributions produced by the teacher's final layer (e.g., after a Softmax function). These soft targets contain richer information about the relationships between different classes than the hard labels alone. For instance, a teacher model might predict an image of a truck as 70% truck, 25% car, and 5% bus, providing nuanced information that the student can learn from. The student model's training objective typically combines a standard loss function (comparing student predictions to ground truth) with a distillation loss (comparing student predictions/soft targets to the teacher's soft targets). This process, initially popularized in a paper by Hinton, Vinyals, and Dean, effectively guides the student to mimic the teacher's reasoning process.

Benefits and Applications

Knowledge Distillation offers several key advantages:

  • Model Compression: It allows the creation of lightweight models that require less memory and storage, crucial for model deployment on devices with limited capacity.
  • Faster Inference: Smaller models generally perform inference much faster, enabling real-time inference capabilities for applications like object detection using Ultralytics YOLO models on edge platforms. Explore options for deploying computer vision applications on edge AI devices.
  • Reduced Computational Cost: Training and running smaller models consumes less energy and computational resources.
  • Knowledge Transfer: It facilitates transferring complex knowledge learned by large models, potentially trained on massive datasets like ImageNet, to smaller architectures.

Real-world applications include:

  1. Edge Computing: Deploying sophisticated computer vision models on devices like smartphones or embedded systems for tasks such as image classification or detection, where computational power and battery life are constraints. A large, accurate model like YOLOv8x could act as a teacher for a smaller student like YOLOv8n.
  2. Accelerating Complex Tasks: As highlighted at YOLO Vision 2023, large Foundation Models can be used for demanding tasks like detailed data annotation, and their knowledge distilled into smaller, faster models for efficient deployment, significantly speeding up processes like data labeling.
  3. Natural Language Processing (NLP): Compressing large language models like BERT or GPT into smaller versions for faster text analysis or translation on user devices.
Read all