Green check
Link copied to clipboard

Get hands-on with Google Gemini 2.5 for computer vision tasks

See how you can get hands-on with Google Gemini 2.5 for computer vision tasks like object detection, image captioning, and OCR for Vision AI solutions.

AI advancements are moving fast, with new innovations making headlines almost every day. One such recent breakthrough is Gemini 2.5, the latest multimodal model from Google DeepMind, launched on March 26th. While traditional Large Language Models (LLMs) can learn from massive amounts of data to generate human-like text, Gemini 2.5 goes beyond that. 

It’s designed as a “thinking model” that can process images, audio, and video. It has enhanced reasoning and coding skills. Interestingly, it also performs exceptionally well with respect to computer vision tasks, where machines interpret and analyze visual data, such as object detection, image captioning, and optical character recognition (OCR).

Fig 1. An example of using Gemini 2.5 to understand the contents of an image.

In this article, we'll walk through one of Ultralytics’ notebooks that can help you get hands-on with Gemini 2.5's computer vision capabilities. We’ll also take a closer look at the key features of Gemini 2.5 and showcase how it can be used to build computer vision solutions for real-world applications. Let’s get started!

Overview of Gemini 2.5: features and capabilities

The first version in the Gemini 2.5 model series that was just released is an experimental release of Gemini 2.5 Pro. It is designed to handle complex problems by thinking through its responses before giving an answer. It uses methods like reinforcement learning (where the model learns from feedback) and chain-of-thought prompting (a step-by-step approach to solving problems).

One of its key features is its huge context window, which can hold 1 million tokens (roughly a million words or word parts) and is expected to grow to 2 million. This means the model can take in a lot of information at once, leading to more detailed and accurate results.

On top of processing language, Gemini 2.5 can be used for the following computer vision tasks:

  • Object detection: It is the process of identifying and locating objects within an image. It can be used in applications such as surveillance or self-driving cars.
  • Image captioning: This task involves generating a descriptive text for an image. It makes visual content more accessible and easier to understand.
  • Optical character recognition: This technology converts text found in images into editable, machine-readable text. It is useful for digitizing documents and automating data entry.

Benchmarking and comparing Google Gemini 2.5 with other models

There are several multimodal models available in the AI space today, so it’s important to understand how Gemini 2.5 Pro compares to them. Based on benchmarking results shared by Google's DeepMind, Gemini 2.5 Pro shows impressive performance across a range of tasks. 

For instance, on a test called Humanity’s Last Exam, which simulates a challenging exam covering many subjects and tests advanced reasoning and general knowledge, Gemini 2.5 Pro scores about 18.8%, outperforming models like OpenAI’s o3-mini, which scores around 14%. 

Fig 2. An overview of Gemini 2.5 Pro’s benchmark performance.

It also performs very well on math and coding challenges, often matching or exceeding the performance of models like OpenAI GPT-4.5, Claude 3.7 Sonnet, Grok 3 Beta, and DeepSeek R1, demonstrating its ability to handle complex tasks and process large amounts of data.

Getting hands-on with Gemini 2.5: How to use the Google Gemini API

Gemini 2.5 Pro is available on multiple platforms. You can experiment with it in Google AI Studio and access it through the Gemini app for Gemini Advanced users. In its launch announcement, Google DeepMind also mentioned that the model will be supported on Vertex AI soon. These access points make it easy for developers to use Gemini 2.5 Pro for real-world AI applications. 

However, if you want to use the Google Gemini API and get started in just a few minutes without complicated setup and are looking to gain a better understanding of its computer vision capabilities, you can check out the Ultralytics notebook that showcases tasks like object detection and image captioning using Gemini 2.5 Pro. Let's walk through what you can expect in the notebook in detail.

Setting up inferencing with the Google Gemini 2.5 notebook

To get started with the Ultralytics notebook and use Google Gemini 2.5, you’ll first need to generate an API key through Google AI Studio. This key gives you access to the Gemini API so you can use the model.

Once you have your API key, make sure your environment has the necessary libraries installed - these include packages from Ultralytics and Google’s AI toolkit. This step is clearly outlined in the notebook, so you can easily follow the instructions to set up your workspace.

With everything configured, you can connect to the Gemini API by entering your API key (as shown below), which creates a link between your workspace and the model. After that, you’ll be ready to send images and text prompts to Gemini 2.5.

Essentially, you can provide an image and a simple instruction (like “detect objects in this image” or “describe what you see”) to the model, and it returns the results you need. This straightforward process makes it easy to start exploring Gemini 2.5's computer vision capabilities.

Object detection with Google Gemini 2.5

One of the key examples in the notebook is object detection using Gemini 2.5 Pro. In this example, you provide the model with an image and a simple prompt to detect objects. 

The model processes the image and returns a set of coordinates and labels for each object it finds; these coordinates are given in normalized form. Functions from the Ultralytics Python package are then used to convert these normalized values to match the actual dimensions of the image and draw clear bounding boxes around each object, as shown below.

Fig 3. Using Google Gemini 2.5 for object detection.

Image captioning using Gemini 2.5

Another interesting example in the notebook is image captioning using Gemini 2.5 Pro. In this example, you provide the model with an image and a prompt asking it to generate a detailed caption that describes what’s in the image. 

The model then analyzes the visual content and returns a narrative, often formatted as multiple sentences, that captures both the content and context of the image. This feature is useful for improving accessibility, summarizing visual information, and even enhancing creative storytelling.

Enhancing OCR accuracy with Google Gemini models

A computer vision task that uses Gemini 2.5 Pro's ability to read text in images is OCR. In the notebook, you can provide the model with an image containing text along with a prompt to extract that text. The model processes the image and returns both the detected text and the coordinates where the text is located, as shown below.

Functions from the Ultralytics Python package are then used to convert these normalized coordinates into the actual dimensions of the image and draw bounding boxes around the text regions. This annotated output makes it clear where the text is located, which is useful for digitizing documents, automating data entry, and improving accessibility.

Fig 4. Extracting textual data in an image using Google Gemini 2.5.

Real-world applications of Google Gemini 2.5

Now that we've walked through how Google Gemini 2.5 Pro can be used for various computer vision tasks, let's explore some real-world applications where these capabilities can be used.

Gemini 2.5 Pro’s object detection ability, for instance, can help automatically label and organize large sets of images, making tasks like dataset creation or content management much faster. It can also be used to analyze images in fields like retail and agriculture - for example, detecting products on shelves or identifying signs of crop stress in farm photos.

Fig 5. Gemini 2.5 Pro analyzing a plant’s health.

Meanwhile, the model’s image captioning feature can help visually impaired users understand what’s in an image. For example, if you have a photo of a busy street, the model might produce a caption that describes the scene in detail, mentioning the types of vehicles, the activity of pedestrians, and even the time of day based on lighting cues. 

In addition to this, Gemini 2.5’s OCR functionality can be used in a variety of applications. For example, you can digitize printed documents by scanning pages or receipts. This capability is ideal for automating data entry tasks, processing forms, or even reading text from business cards and signage. 

Overall, Google Gemini 2.5 Pro opens the doors to a wide range of practical AI applications.

Key takeaways

Going beyond generating and analyzing text, Google Gemini 2.5 Pro can be used for computer vision tasks like object detection, image captioning, and OCR. With its massive context window and enhanced reasoning capabilities, it produces detailed, context-aware results that work well in real-world scenarios. 

As AI models continue to evolve, tools like Gemini 2.5 Pro are making it easier to solve complex problems across industries. It’s likely that we’ll see even broader adoption of AI as more organizations look for flexible, multimodal solutions that can handle a wide range of tasks, from visual understanding to language processing.

Become a part of our community and learn about cutting-edge AI projects on our GitHub repository. See the applications of Vision AI in agriculture and the role of AI in manufacturing on our solutions pages. Explore our licensing plans and build computer vision solutions today!

Facebook logoTwitter logoLinkedIn logoCopy-link symbol

Read more in this category

Let’s build the future
of AI together!

Begin your journey with the future of machine learning