Green check
Link copied to clipboard

Getting Hands-On with YOLO-World

Learn about YOLO-World, an innovative object detection model that can identify objects through text prompts. Explore how YOLO-World works and its applications, and get hands-on with a quick code example.

Computer vision projects often involve spending a lot of time annotating data and training object detection models. But, that might soon be a thing of the past. Tencent’s AI Lab released YOLO-World, a real-time, open-vocabulary object detection model, on January 31st, 2024. YOLO-World is a zero-shot model, meaning you can run object detection inferences on images without having to train it.

Zero-shot models have the potential to change the way we approach computer vision applications. In this blog, we'll explore how YOLO-World works and its potential uses and share a practical code example to get you started.

A Peek into YOLO-World

You can pass an image and text prompt describing what objects you're looking for through the YOLO-World model. For example, if you're interested in finding "a person wearing a red shirt" within a photo, YOLO-World takes this input and gets to work.

The model’s unique architecture combines three main elements:

  • A detector based on the Ultralytics YOLOv8 object detection model, to analyze the visual content of the image
  • A text encoder that is pre-trained by OpenAI’s CLIP, specifically designed to understand your text prompt. 
  • A network, the Vision-Language Path Aggregation Network (RepVL-PAN), which integrates the processed image data with the text data.

The YOLO detector scans your input image to identify potential objects. The text encoder transforms your description into a format that the model can understand. These two streams of information are then merged through the RepVL-PAN using multi-level cross-modality fusion. It lets YOLO-World precisely detect and locate the objects described in your prompt within the image.

An example of results from YOLO-World.

Benefits of Choosing YOLO-World

One of the biggest advantages of using YOLO-World is that you don't have to train the model for a specific class. It has already learned from pairs of images and texts, so it knows how to find objects based on descriptions. You can avoid hours of collecting data, annotating data, training on expensive GPUs, and so on.

Here are some other benefits of using YOLO-World:

  • Real-Time Performance - YOLO-World supports real-time performance just like the original YOLO architecture. It’s ideal for applications requiring immediate object detection such as autonomous vehicles and surveillance systems.
  • Instance Segmentation - YOLO-World can neatly outline and separate objects in pictures, even if those objects weren't specifically taught during its training.
  • Efficiency - YOLO-World combines high accuracy with computational efficiency, making it practical for real-world applications. Its streamlined architecture makes rapid object detection possible without excessive demands on processing power

The Applications of YOLO-World

YOLO-World models can be used for a wide variety of applications. Let’s explore some of them.

Quality Control in Manufacturing

Products manufactured on an assembly line are checked visually for defects before packing them. The defect detection is often done by hand, which takes time and can lead to mistakes. These mistakes can cause problems like high costs and the need for repairs or recalls. To help with this, special machine vision cameras and AI systems have been created to perform these checks. 

YOLO-World models are a big advancement in this area. They can find defects in products even when they haven't been trained for that specific problem using their zero-shot abilities. For example, a factory manufacturing water bottles can easily identify between a bottle sealed properly with a bottle cap versus a bottle where a cap was missed out or faulty using YOLO-World.

An example of bottle cap inspection.

Robotics

YOLO-World models allow robots to interact with unfamiliar environments. Without being trained on specific objects that may be in a room, they can still identify what objects are present. So, let’s say a robot enters a room it has never been in before. With a YOLO-World model, it can still recognize and identify objects like chairs, tables, or lamps, even though it hasn't been specifically trained on those items.

In addition to object detection, YOLO-World can also determine the conditions of those objects, thanks to its 'prompt-then-detect' feature. For instance, in agricultural robotics, it can be used to identify ripe fruits versus not ripe fruits by programming the robot to detect them.

AI in the Automobile Industry

The automobile industry involves many moving parts, and YOLO-World can be used for different car applications. For example, when it comes to car maintenance, YOLO-World's ability to recognize a wide variety of objects without manual tagging or extensive pre-training is extremely useful. YOLO-World can be used to identify car parts that need to be replaced. It could even automate tasks like quality checks, spotting defects or missing pieces in new cars.

Another application is zero-shot object detection in self-driving cars. YOLO-World’s zero-shot detection capabilities can improve an autonomous vehicle’s capability to detect and classify objects on the road, such as pedestrians, traffic signs, and other vehicles, in real time. By doing so, it can help detect obstacles and prevent accidents for a safer journey. 

An example of detecting objects on a road.

Inventory Management for Retail Stores

Identifying objects on shelves in retail stores is an important part of tracking inventory, maintaining stocks, and automating processes. Ultralytics YOLO-World's ability to recognize a wide variety of objects without manual tagging or extensive pre-training is extremely useful for inventory management. 

For instance, in inventory management, YOLO-World can swiftly spot and categorize items on a shelf, such as different brands of energy drinks. Retail stores can keep accurate inventory, manage stock levels efficiently, and smooth out supply chain operations. 

All of the applications are unique and show just how extensively YOLO-World can be used. Next, let’s get hands-on with YOLO-World and take a look at a coding example.

A Code Walk Through

As we mentioned before, YOLO-World can be used to detect different parts of a car for maintenance. A computer vision application that detects any repairs needed would involve taking a picture of the car, identifying car parts, examining each part of the car for damage, and recommending repairs. Every part of this system would use different AI techniques and approaches. For the purpose of this code walkthrough, let’s focus on the part when car parts are detected.

With YOLO-World, you can identify different car parts in an image in under 5 minutes. You can extend this code to try out different applications using YOLO-World as well! To get started, we’ll need to pip install the Ultralytics package as shown below.

# Install ultralytics package
pip install ultralytics

For more instructions and best practices related to the installation process, check our Ultralytics Installation guide. While installing the required packages for YOLOv8, if you encounter any difficulties, take a look at our Common Issues guide for solutions and tips.

Once you’ve installed the needed package, we can download an image from the Internet to run our inferences on. We are going to use the image below.

Our input image.

Then, we’ll import the needed package, initialize our model, and set the classes we are looking for in our input image. Here, we are interested in the following classes: car, wheel, car door, car mirror, and license plate.

# Import YOLOWorld class from ultralytics module
from ultralytics import YOLOWorld

# Initialize the model with pre-trained weights
model = YOLOWorld('yolov8s-world')

# Set the classes you'd like to find in your image
model.set_classes(["car", "wheel", "car door", "car mirror", "license plate"])

We'll then use the predict method, providing the image's path along with parameters for the maximum number of detections, and thresholds for intersection over union (IoU) and confidence (conf) to run an inference on the image. Lastly, the detected objects are saved to a file named 'result.jpg.'

# Run object detection for your custom classes on your image
results = model.predict('path_to_your_image.jpg', max_det=100, iou=0.01, conf=0.01)

# Save the results
results[0].save(filename='result.jpg')

The following output image will be saved to your files.

Our output image.

If you’d prefer to see what YOLO-World can do without coding, you can go to the YOLO-World Demo page, upload an input image, and enter the custom classes. 

Read our docs page on YOLO-World to learn how to save the model with the custom classes so that it can be used directly later without entering custom classes repeatedly.

Did You Notice the Car Doors Weren’t Detected?

If you take a look at the output image again, you’ll notice the custom class “car door” wasn’t detected. Despite its great achievements, YOLO-World has certain limitations. To combat these limitations and use the YOLO-World model effectively, it’s important to use the correct types of textual prompts. 

Here’s some insight into it:

  • YOLO-World may not need high confidence levels for accurate predictions, so reducing confidence thresholds can improve detection rates.
  • Add classes you aren’t interested in. It’ll help improve primary object detection by reducing false positives for secondary objects.
  • Detecting larger objects first before focusing on smaller details, can improve detection accuracy.
  • Mention colors in your classes to detect objects based on color cues.
  • Describing object sizes in prompts can also help YOLO-World identify specific objects more accurately.
  • Post-processing methods, such as filtering predictions by size or adjusting confidence levels per class, can further improve object detection results.

The Limits Are Endless

Overall, YOLO-World models, can be made into a powerful tool with their advanced object detection capabilities It provides great efficiency, accuracy, and helps automate different tasks across various applications, like the example of identifying car parts that we practically discussed.

Feel free to explore our GitHub repository to learn more about our contributions to computer vision and AI. If you're curious about how AI is reshaping sectors like healthcare technology, check out our solutions pages. The possibilities with innovations like YOLO-World seem to be endless!

Facebook logoTwitter logoLinkedIn logoCopy-link symbol

Read more in this category

Let’s build the future
of AI together!

Begin your journey with the future of machine learning