Green check
Link copied to clipboard

Generative AI is changing the road ahead for computer vision

Discover interesting insights from a panel talk at YOLO Vision 2024. Explore how generative AI is shaping the road ahead for real-time Vision AI models.

Generative AI is a branch of artificial intelligence (AI) that creates new content, such as images, text, or audio, by learning patterns from existing data. Thanks to recent advancements, it can now be used to produce highly realistic content that often mimics human creativity.

However, generative AI’s impact goes beyond just creating content. As real-time computer vision models like Ultralytics YOLO models continue to evolve, generative AI is also redefining how visual data is processed and augmented, paving the way for innovative applications in real-world scenarios. 

This new technological shift was an interesting topic of conversation at YOLO Vision 2024 (YV24), an annual hybrid event hosted by Ultralytics. YV24 saw AI enthusiasts and industry leaders come together to discuss the latest breakthroughs in computer vision. The event focused on innovation, efficiency, and the future of real-time AI solutions.

One of the key highlights of the event was a panel talk on YOLO in the Age of Generative AI. The panel featured Glenn Jocher, Founder & CEO of Ultralytics, Jing Qiu, Senior Machine Learning Engineer at Ultralytics, and Ao Wang from Tsinghua University. They explored how generative AI is influencing computer vision and the challenges of building practical AI models.

In this article, we’ll revisit the key insights from their discussion and take a closer look at how generative AI is transforming Vision AI.

Developing the Ultralytics YOLO models

Alongside Glenn Jocher, many skilled engineers have played a vital role in developing the Ultralytics YOLO models. One of them, Jing Qiu, recounted his unexpected start with YOLO. He explained that his passion for AI began during his college years. He spent a significant amount of time exploring and learning about the field. Jing Qiu recalled how he connected with Glenn Jocher on GitHub and got involved in various AI projects.

Adding on to what Jing Qiu said, Glenn Jocher described GitHub as "an incredible way to share - where people you've never met come together to help each other, contributing to one another's work. It's a great community and a really great way to get started in AI."

Fig 1. Glenn Jocher and Jing Qiu speaking on stage at YV24.

Jing Qiu's interest in AI and his work on Ultralytics YOLOv5 helped refine the model. Later, he played a key role in developing Ultralytics YOLOv8, which introduced further improvements. He described it as an incredible journey. Today, Jing Qiu continues to improve and work on models like Ultralytics YOLO11

YOLOv10: Optimized for real-world performance

Joining the panel talk remotely from China, Ao Wang introduced himself as a PhD student. Initially, he studied software engineering, but his passion for AI led him to shift toward computer vision and deep learning.

His first encounter with the famous YOLO model was while experimenting with various AI techniques and models. He was impressed by its speed and accuracy, which inspired him to dive deeper into computer vision tasks like object detection. Recently, Ao Wang contributed to YOLOv10, a recent version of the YOLO model. His research focused on optimizing the model to be faster and more accurate.

The key difference between generative AI and Vision AI

Then, the panel started to discuss generative AI, and Jing Qiu pointed out that generative AI and Vision AI have very different purposes. Generative AI creates or generates things like text, images, and videos, while Vision AI analyzes what already exists, mainly images.

Glenn Jocher highlighted that size is a big difference, too. Generative AI models are massive, often containing billions of parameters - internal settings that help the model learn from data. Computer vision models are much smaller. He said, “The smallest YOLO model we have is about a thousand times smaller than the smallest LLM [Large Language Model]. So, 3 million parameters compared to three billion.”

Fig 3. The panel discussion on generative AI and Vision AI at YV24.

Jing Qiu added that generative AI and computer vision training and deployment processes are also very different. Generative AI needs huge, powerful servers to run. Models like YOLO, on the other hand, are built for efficiency and can be trained and deployed on standard hardware. That makes Ultralytics YOLO models more practical for real-world use.

Even though they are different, these two fields are starting to intertwine. Glenn Jocher elaborated that Generative AI is bringing new advancements to Vision AI, making models smarter and more efficient. 

The impact of generative AI on computer vision

Generative AI has advanced quickly, and these breakthroughs are influencing many other areas of artificial intelligence, including computer vision. Next, let's walk through some fascinating insights from the panel on this.

Hardware advances are enabling AI innovations

Early on in the panel, Glenn Jocher explained that machine-learning ideas have been around for a long time, but computers weren’t powerful enough to make them work. AI ideas needed stronger hardware to make them a reality.

The rise of GPUs (Graphics Processing Units) over the last 20 years with parallel processing capabilities changed everything. They made training AI models much faster and more efficient, which allowed deep learning to develop at a rapid pace.

Nowadays, AI chips like TPUs (Tensor Processing Units) and optimized GPUs use less power while handling larger and more complex models. This has made AI more accessible and useful in real-world applications.

With every new hardware improvement, both generative AI and computer vision applications are becoming more powerful. These advancements are making real-time AI faster, more efficient, and ready for use in more industries.

How generative AI is shaping object detection models

When asked how generative AI is influencing computer vision, Jing Qiu said that transformers - models that help AI focus on the most important parts of an image - have changed the way AI understands and processes images. The first big step was DETR (Detection Transformer), which used this new approach for object detection. It improved accuracy but had performance issues that made it slower in some cases.

To solve this, researchers created hybrid models like RT-DETR. These models combine Convolutional Neural Networks (CNNs, which are deep learning models that automatically learn and extract features from images) and transformers, balancing speed and accuracy. This approach leverages the benefits of transformers while making object detection faster.

Interestingly, YOLOv10 uses transformer-based attention layers (parts of the model that act like a spotlight to highlight the most important areas in an image while ignoring less relevant details) to boost its performance. 

Ao Wang also mentioned how generative AI is changing the way models are trained. Techniques like masked image modeling help AI learn from images more efficiently, reducing the need for large, manually labeled datasets. This makes computer vision training faster and less resource-intensive.

The future of generative AI and Vision AI 

Another key idea the panel discussed was how generative AI and Vision AI might come together to build more capable models. Glenn Jocher explained that while these two approaches have different strengths, combining them could open up new possibilities. 

For instance, Vision AI models like YOLO often break an image into a grid to identify objects. This grid-based method could help language models improve their ability to both pinpoint details and describe them - a challenge many language models face today. In essence, merging these techniques might lead to systems that can accurately detect and clearly explain what they see.

Fig 4. The future of generative and Vision AI. Image by author.

Key takeaways

Generative AI and computer vision are advancing together. While Generative AI creates images and videos, it also improves image and video analysis by bringing to the table new innovative ideas that could make Vision AI models more accurate and efficient. 

In this insightful YV24 panel talk, Glenn Jocher, Jing Qiu, and Ao Wang shared their thoughts on how these technologies are shaping the future. With better AI hardware, generative AI and Vision AI will continue to evolve, leading to even greater innovations. These two fields are working together to create smarter, faster, and more useful AI for everyday life.

Join our community and explore our GitHub repository to learn more about Vision AI. Check out our licensing options to kickstart your computer vision projects. Interested in innovations like AI in manufacturing or computer vision in self-driving? Visit our solutions pages to discover more. 

Facebook logoTwitter logoLinkedIn logoCopy-link symbol

Read more in this category

Let’s build the future
of AI together!

Begin your journey with the future of machine learning