Glossary

Model Serving

Learn the essentials of model serving—deploy AI models for real-time predictions, scalability, and seamless integration into applications.

Model serving is the process of making a trained machine learning (ML) model available to receive input data and return predictions in a production environment. Once a model is trained and validated, serving is the critical step that transforms it from a static file into an active, operational tool that can power real-world applications. It involves deploying the model on a server and creating an interface, typically an API, that allows other software systems to communicate with it for real-time inference.

While closely related, model serving is a specific component within the broader process of model deployment. Deployment encompasses the entire workflow of integrating a model into a production setting, including infrastructure setup and monitoring. Model serving refers specifically to the part of that infrastructure responsible for running the model and handling inference requests efficiently.

Key Components of Model Serving

A robust model serving system consists of several integrated components that work together to deliver fast and reliable predictions.

Model Format: Before serving, a model must be packaged into a standardized format. Formats like ONNX ensure interoperability across different frameworks. For maximum performance, models can be optimized using tools like TensorRT for NVIDIA GPUs.
Serving Framework: Specialized software that loads the model, manages hardware resources like GPUs, and processes inference requests. Popular frameworks include TensorFlow Serving, PyTorch Serve, and the high-performance NVIDIA Triton Inference Server, which can be used with Ultralytics models via our Triton integration guide.
API Endpoint: This is the communication gateway that allows client applications to send data (like an image or text) and receive the model's prediction. REST and gRPC are common API protocols used for this purpose.
Infrastructure: The physical or virtual hardware where the model runs. This can range from on-premises servers to cloud computing platforms like Amazon SageMaker and Google Cloud AI Platform. For applications requiring low latency, models are often served on edge AI devices. Containerization with tools like Docker is essential for creating portable and scalable serving environments.
Monitoring and Logging: Continuous tracking of model performance and system health. This includes monitoring metrics like inference latency and throughput, as well as watching for issues like data drift, which can degrade model accuracy over time. You can learn more in our guide to model monitoring.

Real-World Applications

Model serving is the engine behind countless AI-powered features.

AI-Powered Inventory Management: A retail company uses an Ultralytics YOLO11 model for real-time inventory management. The model is packaged in an ONNX format and served on a small edge computer inside the store. A camera sends a video feed to the serving endpoint, which performs object detection to count items on shelves and sends alerts when stock is low.
Medical Image Analysis in the Cloud: A hospital system deploys a sophisticated computer vision model for medical image analysis. Due to the large model size and computational needs, it's served on a powerful cloud server with multiple GPUs. Radiologists upload high-resolution scans through a secure portal, which calls the serving API. The model returns predictions that assist in identifying potential anomalies, improving diagnostic speed and accuracy.

The Role of MLOps

Model serving is a cornerstone of Machine Learning Operations (MLOps). A good MLOps strategy ensures that the entire lifecycle—from data preprocessing and model training to serving and monitoring—is automated, reliable, and scalable. Platforms like Ultralytics HUB are designed to simplify this entire workflow, providing an integrated solution to train, version, and serve computer vision models effectively.

Model Serving

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

Key Components of Model Serving

Real-World Applications

The Role of MLOps

Read more in this category

Manufacturing ERP Guide

Manufacturing execution system (MES): AI-driven production

Understanding additive manufacturing: Technology & use cases

Join the Ultralytics community