Learn how computer vision tasks like object tracking, instance segmentation, and image classification work and how Ultralytics YOLO11 supports them.
Thanks to cameras and advancements in artificial intelligence (AI), computers and machines are now able to see the world in a way that's similar to how humans do. For example, they can recognize people, track objects, and even understand the context of what’s happening in a video.
Specifically, computer vision is the branch of AI that enables machines to understand and interpret visual information from the world around them. Computer vision involves a variety of tasks, each designed to extract a specific kind of insight from images or videos. For instance, object detection helps identify and locate different items in a picture, while other tasks like tracking, segmentation, and pose estimation help machines understand movement, shapes, and positions more accurately.
The computer vision task used for a particular application depends on the type of insights you need. Computer vision models like Ultralytics YOLO11 support various computer vision tasks, making it a reliable choice for building real-world Vision AI systems.
In this guide, we’ll take a closer look at the computer vision tasks supported by models like YOLO11. We’ll explore how each task works and how they’re being used across different industries. Let’s get started!
Computer vision tasks aim to replicate human vision abilities in different ways. These tasks can help machines detect objects, track their movements, estimate poses, and even outline individual elements in images and videos. Typically, computer vision tasks are enabled by models that break visual data into smaller parts so that they can interpret what’s happening more clearly.
Vision AI models like Ultralytics YOLO models support multiple tasks, such as detection, tracking, and segmentation, in one framework. Due to this versatility, YOLO11 models are easy to adopt for a wide variety of use cases.
A good example of this is in sports analytics. YOLO11 can be used to detect each player on the field using object detection, then it can follow them throughout the match with object tracking. Meanwhile, YOLO11's pose estimation capabilities can help analyze player movements and techniques, and instance segmentation can separate each player from the background, adding precision to the analysis.
Together, these YOLO11-enabled computer vision tasks create a complete picture of what’s happening during the game, giving teams deeper insights into player performance, tactics, and overall strategy.
Now that we've taken a look at what computer vision tasks are, let's dive into understanding each one supported by YOLO11 in more detail, using real-world examples.
When you look at a photo, most people can easily tell if it shows a dog, a mountain, or a traffic sign because we've all learned what these things typically look like. Image classification helps machines do the same by teaching them how to classify and label an image based on its main object - whether it’s a "car," "banana," or an "x-ray with fracture." This label helps computer vision systems understand the visual content so they can respond or make decisions accordingly.
One interesting application of this computer vision task is wildlife monitoring. Image classification can be used to identify different animal species from photos captured in the wild. By automatically labeling images, researchers can track populations, monitor migration patterns, and identify endangered species more easily to support conservation efforts.
While image classification is helpful for getting an overall idea of what an image contains, it only assigns one label to the entire image. In situations where detailed information, such as the precise location and identity of multiple objects, is required, object detection becomes essential.
Object detection is the process of identifying and locating individual objects within an image, often by drawing bounding boxes around them. Ultralytics YOLO11 performs especially well at real-time object detection, making it ideal for a wide range of applications.
Take, for example, computer vision solutions used in retail stores for stocking shelves. Object detection can help count fruits, vegetables, and other items, ensuring an accurate inventory. In agricultural fields, the same technology can monitor crop maturity to help farmers determine the best time to harvest, even distinguishing between ripe and unripe produce.
Object detection uses bounding boxes to identify and locate objects in an image, but it doesn’t capture their exact shapes. That’s where instance segmentation comes in. Instead of drawing a box around an object, instance segmentation traces its precise outline.
You can think of it like this: rather than simply indicating that "there's an apple in this area," it carefully outlines and fills in the apple's exact shape. This detailed process helps AI systems clearly understand an object's boundaries, especially when objects are close together.
Instance segmentation can be applied to many applications, from infrastructure inspections to geological surveys. For instance, data from geological surveys can be analyzed using YOLO11 to segment both large and small surface cracks or abnormalities. By drawing precise boundaries around these anomalies, engineers can pinpoint issues and address them before a project begins.
So far, the computer vision tasks we've looked at focus on what’s in a single image. However, when it comes to videos, we need insights that go beyond one frame. The task, object tracking, can be used for this.
YOLO11's object tracking ability can follow a specific object, like a person or a car, as it moves across a series of video frames. Even if the camera angle changes or other objects appear, the system continues to follow the same target.
This is crucial for applications that require monitoring over time, such as tracking cars in traffic. In fact, YOLO11 can accurately track vehicles, following each car to help estimate their speed in real time. This makes object tracking a key component in systems like traffic monitoring.
Objects in the real world aren’t always perfectly aligned - they can be tilted, sideways, or positioned at odd angles. For instance, in satellite images, ships and buildings often appear rotated.
Traditional object detection methods use fixed rectangular boxes that don't adjust to an object's orientation, making it difficult to accurately capture these rotated shapes accurately. Oriented bounding box (OBB) detection solves this problem by using boxes that rotate to fit snugly around an object, aligning with its angle for more precise detection.
With respect to harbor monitoring, YOLO11’s support for OBB detection can help accurately identify and track vessels regardless of their orientation, ensuring that every ship entering or leaving the harbor is properly monitored. This precise detection provides real-time information on vessel positions and movements, which is critical for managing busy ports and preventing collisions.
Pose estimation is a computer vision technique that tracks key points, such as joints, limbs, or other markers, to understand how an object moves. Rather than treating an entire object or body as one complete unit, this method breaks it down into its key parts. This makes it possible to analyze movements, gestures, and interactions in detail.
One common application of this technology is human pose estimation. By tracking the positions of various body parts in real time, it provides a clear picture of how a person is moving. This information can be used for a variety of purposes, from gesture recognition and activity monitoring to performance analysis in sports.
Similarly, in physical rehabilitation, therapists can use human pose estimation and YOLO11 to monitor patients’ movements during exercises. This helps make sure that each movement is done correctly while tracking progress over time.
Now that we’ve explored all the computer vision tasks supported by YOLO11 in detail, let’s walk through how YOLO11 supports them.
YOLO11 isn't just one model - it's a suite of specialized model variants, each designed for a specific computer vision task. This makes YOLO11 a versatile tool that can be adapted to a wide range of applications. You can also fine-tune these models on custom datasets to tackle the unique challenges of your projects.
Here are the YOLO11 model variants pre-trained for specific vision tasks:
Each variant is available in different sizes, allowing users to choose the right balance between speed and accuracy for their specific needs.
Computer vision tasks are changing the way machines understand and interact with the world. By breaking down images and videos into key elements, these technologies make it easier to analyze objects, movements, and interactions in detail.
From improving traffic safety and sports performance to streamlining industrial processes, models like YOLO11 can provide real-time insights that drive innovation. As Vision AI continues to evolve, it will likely play an increasingly important role in how we interpret and use visual data every day.
Join our community and visit our GitHub repository to see AI in action. Explore our licensing options and discover more about AI in agriculture and computer vision in manufacturing on our solutions pages.
Begin your journey with the future of machine learning