Florence-2: Microsoft's latest vision-language model

Meet Florence-2, Microsoft's visual language model that offers improved object detection, segmentation, and zero-shot performance with great efficiency.

Written by

Abirami Vina

min read

Jul 26, 2024

Apr 4, 2025

What is Florence-2?

Creating the FLD-5B dataset

Understanding Florence-2’s model architecture

Comparing Florence-2 with other VLMs

Applications of Florence-2

Applications of image captioning

Using visual grounding while cooking

Region-based OCR for financial documents

Region-based segmentation in industrial applications

Key takeaways

In June 2024, Microsoft introduced Florence-2, a multi-modal visual language model (VLM) that is designed to handle a wide range of tasks including object detection, segmentation, image captioning, and grounding. Florence-2 sets a new benchmark for zero-shot performance, meaning it can perform tasks without prior specific training, and boosts a smaller model size than other state-of-the-art vision-language models.

It’s more than just another model, Florence-2's versatility and improved performance has the potential to significantly impact various industries by improving accuracy and reducing the need for extensive training. In this article, we will explore the innovative features of Florence-2, compare its performance with other VLMs, and discuss its potential applications.

What is Florence-2?

Florence-2 can handle a variety of tasks within a single unified framework. The model's impressive capabilities are partly thanks to its massive training dataset called FLD-5B. FLD-5B includes 5.4 billion annotations across 126 million images. This comprehensive dataset was created specifically to enable Florence-2 with the capabilities needed to handle a wide range of vision tasks with high accuracy and efficiency.

Here’s a closer look at the tasks Florence-2 supports:

Object Detection: It can identify and locate objects within images with high precision.
‍
Segmentation: This task involves dividing an image into meaningful segments for easier analysis and interpretation.
‍
Image Captioning: Florence-2 is capable of generating descriptive captions for images that provide context and details.
‍
Visual Grounding: The model can associate specific phrases or words in a caption with the corresponding regions in the image.
‍
Zero-shot Performance: It can perform tasks without specific training.

Fig 1. Understanding How Florence-2 Was Trained.

‍

The model supports both text-based and region-based tasks. Special location tokens are added to the model's vocabulary for tasks involving specific regions of an image. These tokens help the model understand different shapes, such as rectangles around objects (box representation), four-sided shapes (quad box representation), and many-sided shapes (polygon representation). The model is trained using a method called cross-entropy loss, which helps it learn by comparing its predictions to the correct answers and adjusting its internal parameters accordingly.

Creating the FLD-5B dataset

The FLD-5B dataset includes different types of annotations: text descriptions, pairs of regions and text, and combinations of text, phrases, and regions. It was created through a two-step process involving data collection and annotation. Images were sourced from popular datasets like ImageNet-22k, Object 365, Open Images, Conceptual Captions, and LAION. The annotations in the FLD-5B dataset are mostly synthetic, meaning they were automatically generated rather than manually labeled.

‍

Initially, specialist models skilled at specific tasks, like object detection or segmentation, created these annotations. Then, a filtration and enhancement process was used to make sure that the annotations were detailed and accurate. After removing any noise, the dataset went through iterative refinement, where Florence-2's outputs were used to continuously update and improve the annotations.

Understanding Florence-2’s model architecture

Florence-2's model architecture follows a sequence-to-sequence learning approach. This means that the model processes an input sequence (like an image with a text prompt) and generates an output sequence (like a description or a label) in a step-by-step manner. In the sequence-to-sequence framework, each task is treated as a translation problem: the model takes an input image and a task-specific prompt and generates the corresponding output.

Fig 3. Florence-2’s Vision-Language Model Architecture.

‍

At the core of the model architecture is a multi-modality encoder-decoder transformer, which combines an image encoder and a multi-modality encoder-decoder. The image encoder, called DaViT (Data-efficient Vision Transformer), processes input images by converting them into visual token embeddings - compact representations of the image that capture both spatial (where things are) and semantic (what things are) information. These visual tokens are then combined with text embeddings (representations of the text), allowing the model to seamlessly merge textual and visual data.

Comparing Florence-2 with other VLMs

Florence-2 stands out from other visual language models due to its impressive zero-shot capabilities. Unlike models like PaliGemma, which rely on extensive fine-tuning to adapt to various tasks, Florence-2 works well right out of the box. Also, Florence-2 is able to compete with larger models like GPT-4V and Flamingo, which often have many more parameters but don't always match Florence-2's performance. For example, Florence-2 achieves better zero-shot results than Kosmos-2, despite Kosmos-2 having over twice the number of parameters.

In benchmark tests, Florence-2 has shown remarkable performance in tasks like COCO captioning and referring expression comprehension. It outperformed models like PolyFormer and UNINEXT in object detection and segmentation tasks on the COCO dataset. It is a highly competitive choice for real-world applications where both performance and resource efficiency are crucial.

Applications of Florence-2

Florence-2 can be used in many different industries, such as entertainment, accessibility, education, etc. Let’s walk through a few examples to get a better understanding.

Applications of image captioning

When you are on a streaming platform trying to decide what to watch, you might read a summary of a movie to help you choose. What if the platform could also provide a detailed description of the movie poster? Florence-2 can make that possible through image captioning, which generates descriptive text for images. Florence-2 can generate detailed descriptions of movie posters, making streaming platforms more inclusive for visually impaired users. By analyzing the visual elements of a poster, such as characters, scenery, and text, Florence-2 can create detailed descriptions that convey the poster's content and mood. The image below shows the level of detail that Florence-2 can provide in its description.

Fig 4. An example of an image caption generated by Florence-2.

‍

Here are some other examples of where image captioning can be helpful:

E-commerce: Image captioning can provide detailed descriptions of product images, helping customers understand product features and details more clearly.
‍
Travel and Tourism: It can provide detailed descriptions of landmarks and attractions in travel guides and apps.
‍
Education: Image captioning can label and describe educational images and diagrams, aiding in teaching and learning.
‍
Real Estate: It can provide detailed descriptions of property images that highlight features and amenities for potential buyers.

Using visual grounding while cooking

Florence-2 can also be used to enrich culinary experiences. For instance, an online cookbook could use Florence-2 to visually ground and label parts of a complex recipe image. Visual grounding helps here by linking specific parts of the image to corresponding descriptive text. Each ingredient and step can be accurately labeled and explained, making it easier for home cooks to follow the recipe and understand each component's role in the dish.

Fig 5. An example of visual grounding using Florence-2.

‍

Region-based OCR for financial documents

OCR with region-based processing, which focuses on extracting text from specific areas within a document, can come in handy when it comes to fields like accounting. Designated areas of financial documents can be analyzed to automatically extract important information such as transaction details, account numbers, and due dates. By reducing the need for manual data entry, it minimizes errors and speeds up processing times. Financial institutions can use it to streamline tasks such as invoice processing, receipt reconciliation, and check clearing, leading to faster transactions and better customer service.

Fig 6. An example of extracting OCR with region using Florence-2.

‍

Region-based segmentation in industrial applications

Region-based segmentation, which involves dividing an image into meaningful parts for focused analysis and detailed inspection, can fuel industrial applications that improve precision and efficiency in various processes. By focusing on specific areas within an image, this technology allows for detailed inspection and analysis of components and products. With respect to quality control, it can identify defects or inconsistencies in materials, such as cracks or misalignments, ensuring that only top-quality products reach the market.

‍

Fig 7. An example of segmentation based on regions using Florence-2.

‍

It also improves automated assembly lines by guiding robotic arms to specific parts and optimizing the placement and assembly of components. Similarly, in inventory management, it helps track and monitor the condition and location of goods, leading to more efficient logistics and reduced downtime. Overall, region-based segmentation boosts accuracy and productivity, leading to cost savings and higher product quality in industrial settings.

Key takeaways

We are starting to see a trend where AI models are becoming lighter while still maintaining high performance. Florence-2 marks a major step forward in terms of visual language models. It can handle various tasks like object detection, segmentation, image captioning, and grounding with impressive zero-shot performance. Despite its smaller size, Florence-2 is efficient and multi-functional, which makes it extremely useful in terms of applications across different industries. Models like Florence-2 are bringing more possibilities to the table, expanding the potential for AI innovations.

Explore more on AI by visiting our GitHub repository and joining our community. Check out our solutions pages to read about AI applications in manufacturing and agriculture. 🚀

Florence-2: Microsoft's latest vision-language model

What is Florence-2?

Creating the FLD-5B dataset

Understanding Florence-2’s model architecture

Comparing Florence-2 with other VLMs

Applications of Florence-2

Applications of image captioning

Using visual grounding while cooking

Region-based OCR for financial documents

Region-based segmentation in industrial applications

Key takeaways

Read more in this category

Let’s build the future
of AI together!

Florence-2: Microsoft's latest vision-language model

What is Florence-2?

Creating the FLD-5B dataset

Understanding Florence-2’s model architecture

Comparing Florence-2 with other VLMs

Applications of Florence-2

Applications of image captioning

Using visual grounding while cooking

Region-based OCR for financial documents

Region-based segmentation in industrial applications

Key takeaways

Read more in this category

Let’s build the future of AI together!

Let’s build the future
of AI together!