Discover YOLO12, the latest computer vision model! Learn how its attention-centric architecture and FlashAttention technology enhance object detection tasks across industries
Computer vision is a branch of artificial intelligence (AI) that helps machines understand images and videos. It is a field that is advancing at an incredible pace because AI researchers and developers are constantly pushing the limits. The AI community is always aiming to make models faster, smarter, and more efficient. One of the latest breakthroughs is YOLO12, the newest addition to the YOLO (You Only Look Once) model series, released on February 18th, 2025.
YOLO12 was developed by researchers from the University at Buffalo, SUNY (State University of New York), and the University of Chinese Academy of Sciences. In a unique new approach, YOLO12 introduces attention mechanisms, allowing the model to focus on the most essential parts of an image rather than processing everything equally.
It also features FlashAttention, a technique that speeds up processing while using less memory, and an area attention mechanism, designed to mimic the way humans naturally focus on central objects.
These improvements make YOLO12n 2.1% more accurate than YOLOv10n and YOLO12m +1.0% more accurate than YOLO11m. However, this comes with a tradeoff - YOLO12n is 9% slower than YOLOv10n, and YOLO12m is 3% slower than YOLO11m.
In this article, we’ll explore what makes YOLO12 different, how it compares to previous versions, and where it can be applied.
The YOLO model series is a collection of computer vision models designed for real-time object detection, meaning they can quickly identify and locate objects in images and videos. Over time, each version has improved in terms of speed, accuracy, and efficiency.
For instance, Ultralytics YOLOv5, released in 2020, became widely used because it was fast and easy to custom-train and deploy. Later, Ultralytics YOLOv8 improved on this by offering additional support for computer vision tasks like instance segmentation and object tracking.
More recently, Ultralytics YOLO11 focused on improving real-time processing while maintaining a balance between speed and accuracy. For example, YOLO11m had 22% fewer parameters than YOLOv8m, yet still delivered better detection performance on the COCO dataset, a widely used benchmark for evaluating object detection models.
Building on these advancements, YOLO12 introduces a shift in how it processes visual information. Rather than treating all parts of an image equally, it prioritizes the most relevant areas, improving detection accuracy. Simply put, YOLO12 builds on previous improvements while aiming to be more precise.
YOLO12 introduces several improvements that enhance computer vision tasks while keeping real-time processing speeds intact. Here's an overview of YOLO12’s key features:
To understand how these features work in real life, consider a shopping mall. YOLO12 can help track shoppers, identify store decorations like potted plants or promotional signs, and spot misplaced or abandoned items.
Its attention-centric architecture helps it focus on the most important details, while FlashAttention ensures it processes everything quickly without overloading the system. This makes it easier for mall operators to improve security, organize store layouts, and enhance the overall shopping experience.
However, YOLO12 also comes with some limitations to consider:
YOLO12 comes in multiple variants, each optimized for different needs. Smaller versions (nano and small) prioritize speed and efficiency, making them ideal for mobile devices and edge computing. The medium and large versions strike a balance between speed and accuracy, while YOLO12x (extra large) is designed for high-precision applications, such as industrial automation, medical imaging, and advanced surveillance systems.
With these variations, YOLO12 delivers different levels of performance depending on the model size. Benchmark tests show that certain variants of YOLO12 outperform YOLOv10 and YOLO11 in accuracy, achieving higher mean average precision (mAP).
However, some models, like YOLO12m, YOLO12l, and YOLO12x, process images slower than YOLO11, showing a trade-off between detection accuracy and speed. Despite this, YOLO12 remains efficient, requiring fewer parameters than many other models, though it still uses more than YOLO11. This makes it a great choice for applications where accuracy is more important than raw speed.
YOLO12 is supported by the Ultralytics Python package and is easy to use, making it accessible for both beginners and professionals. With just a few lines of code, users can load pre-trained models, run various computer vision tasks on images and videos, and also train YOLO12 on custom datasets. The Ultralytics Python package streamlines the process, eliminating the need for complex setup steps.
For example, here are the steps you would go through to use YOLO12 for object detection:
These steps make YOLO12 easy to use for a variety of applications, from surveillance and retail tracking to medical imaging and autonomous vehicles.
YOLO12 can be used in a variety of real-world applications thanks to its support for object detection, instance segmentation, image classification, pose estimation, and oriented object detection (OBB).
However, as we discussed earlier, the YOLO12 models prioritize accuracy over speed, meaning they take slightly longer to process images compared to earlier versions. This tradeoff makes YOLO12 ideal for applications where precision is more important than real-time speed, such as:
Before running YOLO12, it's important to make sure your system meets the necessary requirements.
Technically, YOLO12 can run on any dedicated GPU (Graphics Processing Unit). By default, it does not require FlashAttention, so it can work on most GPU systems without it. However, enabling FlashAttention can be especially useful when working with large datasets or high-resolution images, as it helps prevent slowdowns, reduce memory usage, and improve processing efficiency.
To use FlashAttention, you’ll need an NVIDIA GPU from one of these series: Turing (T4, Quadro RTX), Ampere (RTX 30 series, A30, A40, A100), Ada Lovelace (RTX 40 series), or Hopper (H100, H200).
Keeping usability and accessibility in mind, the Ultralytics Python package does not yet support FlashAttention inference, as its installation can be quite technically complex. To learn more about getting started with YOLO12 and optimizing its performance, check out the official Ultralytics documentation.
As computer vision advances, models are becoming more precise and efficient. YOLO12 improves computer vision tasks like object detection, instance segmentation, and image classification with attention-centric processing and FlashAttention, enhancing accuracy while optimizing memory use.
At the same time, computer vision is more accessible than ever. YOLO12 is easy to use through the Ultralytics Python package and, with its focus on accuracy over speed, is well-suited for medical imaging, industrial inspections, and robotics - applications where precision is key.
Curious about AI? Visit our GitHub repository and engage with our community. Explore innovations in sectors like AI in self-driving cars and computer vision in agriculture on our solutions pages. Check out our licensing options and bring your Vision AI projects to life. 🚀
Begin your journey with the future of machine learning