Glossary

Model Serving

Learn the essentials of model serving—deploy AI models for real-time predictions, scalability, and seamless integration into applications.

Train YOLO models simply
with Ultralytics HUB

Learn more

Once a Machine Learning (ML) model is trained and validated, the next critical step is making it available to generate predictions on new data. This process is known as Model Serving. It involves deploying a trained model into a production environment, typically behind an API endpoint, allowing applications or other systems to request predictions in real-time. Model serving acts as the bridge between the developed model and its practical application, transforming it from a static file into an active, value-generating service within the broader Machine Learning Lifecycle.

Importance Of Model Serving

Model serving is fundamental for operationalizing ML models. Without it, even the most accurate models, like state-of-the-art Ultralytics YOLO object detectors, remain isolated in development environments, unable to impact real-world processes. Effective model serving ensures:

Real-World Applications

Model serving enables countless AI-driven features we interact with daily. Here are two examples:

  1. E-commerce Product Recommendations: When you browse an online store, a model serving backend powers the recommendation system. It takes your browsing history or user profile as input and returns personalized product suggestions in real-time.
  2. Medical Diagnosis Assistance: In healthcare, models trained for medical image analysis can be served via an API. Doctors can upload patient scans (like X-rays or MRIs) to the service, which then returns potential anomalies or diagnostic insights, aiding clinical decision-making. Platforms like Ultralytics HUB facilitate the deployment of such specialized models.

Key Components Of Model Serving

Implementing a robust model serving system involves several components:

  • Model Format: The trained model needs to be saved in a format suitable for deployment, such as ONNX, TensorFlow SavedModel, or optimized formats like TensorRT.
  • Serving Framework: Software like TensorFlow Serving, TorchServe, or NVIDIA Triton Inference Server manages the model lifecycle, handles requests, and performs inference.
  • API Endpoint: An interface (often managed by an API Gateway) exposes the model's prediction capabilities to client applications.
  • Infrastructure: The underlying hardware and software environment, which could be on-premises servers, cloud computing instances, or even specialized edge computing devices.
  • Monitoring: Tools and processes for model monitoring track performance, latency, errors, and potential data drift to ensure the served model remains effective over time.

Model Deployment Vs. Model Serving

While the terms Model Deployment and Model Serving are often related, they aren't identical. Model deployment is the broader concept of making a trained model available for use. This can encompass various strategies, including embedding models directly into applications, deploying them onto edge devices for offline inference, or setting up batch processing pipelines that run predictions periodically. You can explore different Model Deployment Options depending on your needs.

Model serving specifically refers to deploying a model as a network service, usually accessible via an API, designed for handling on-demand, often real-time, prediction requests. It's a specific type of model deployment focused on providing continuous inference capabilities with considerations for scalability and low latency. For many interactive applications requiring immediate predictions, model serving is the preferred deployment method.

Read all