Verificação verde
Link copiado para a área de transferência

AI Research Updates From Meta FAIR: SAM 2.1 and CoTracker3

Explore Meta FAIR’s latest AI models, SAM 2.1 and CoTracker3, offering advanced segmentation and tracking capabilities for diverse, real-world applications.

Artificial intelligence (AI) is a field of research that has recently been buzzing with excitement and energy, with new innovations and breakthroughs appearing faster than ever before. In the past few weeks, Meta’s Fundamental AI Research (FAIR) team unveiled a set of tools and models aimed at tackling challenges in different areas of AI. These releases include updates that could impact fields as diverse as healthcare, robotics, and augmented reality.

For instance, the updated SAM 2.1 model improves object segmentation, making it easier to accurately identify and separate objects in images and videos. Meanwhile, CoTracker3 focuses on point tracking, helping keep track of points in video frames even when objects move or get partially blocked. 

Meta has also introduced lighter, faster versions of its Llama language model for efficient on-device use, along with new tactile sensing technology for robotics. In this article, we’ll break down these latest releases from Meta FAIR, looking at what each tool offers. Let’s get started!

Meta’s Enhanced Segment Anything Model: SAM 2.1

Object segmentation, a key computer vision task, makes it possible to identify and separate distinct objects within an image or video, making it easier to analyze specific areas of interest. Since its release, Meta’s Segment Anything Model 2 (SAM 2) has been used for object segmentation across different fields like medical imaging and meteorology. Building on feedback from the community, Meta has now introduced SAM 2.1, an improved version designed to tackle some of the challenges encountered with the original model and deliver stronger performance overall.

Fig 1. SAM 2.1 Model Performance Benchmarking.

SAM 2.1 includes updates to better handle visually similar and smaller objects, thanks to new data augmentation techniques. It also improves how the model deals with occlusion (when parts of an object are hidden from view) by training it on longer video sequences, allowing it to "remember" and recognize objects over time, even if they’re temporarily blocked. For example, if someone is filming a video of a person walking behind a tree, SAM 2.1 can track the person as they reappear on the other side, using its memory of the object’s position and movement to fill in gaps when the view is briefly interrupted.

Alongside these updates, Meta has released the SAM 2 Developer Suite, providing open-source training code and full demo infrastructure so developers can fine-tune SAM 2.1 with their own data and integrate it into a range of applications.

CoTracker3: Meta’s Tracking Model and its Features and Updates

Another interesting computer vision task is point tracking. It involves following specific points or features across multiple frames in a video. Consider a video of a cyclist riding along a track - point tracking lets the model keep track of points on the cyclist, like the helmet or wheels, even if they’re hidden by obstacles for a moment.

Point tracking is essential for applications like 3D reconstruction, robotics, and video editing. Traditional models often rely on complex setups and large synthetic datasets, which limits their effectiveness when applied to real-world scenarios. 

Meta’s CoTracker3 tracking model addresses these limitations by simplifying the model’s architecture. It also introduces a pseudo-labeling technique that lets the model learn from real, unannotated videos, making CoTracker3 more efficient and scalable for practical use.

Fig 2. Comparing CoTracker3 to Other Tracking Models.

One of the features that makes CoTracker3 stand out is that it can handle occlusions well. Using cross-track attention, a technique that allows the model to share information across multiple tracked points, CoTracker3 can infer the positions of hidden points by referencing visible ones. By doing so, CoTracker3 is designed to be highly effective in dynamic environments, such as following a person through a crowded scene. 

CoTracker3 also offers both online and offline modes. The online mode provides real-time tracking. While the offline mode can be used for more comprehensive tracking across entire video sequences, ideal for tasks like video editing or animation

Other Updates and Research from Meta FAIR

While SAM 2.1 and CoTracker3 showcase Meta’s latest advancements in computer vision, there are also exciting updates in other areas of AI, such as natural language processing (NLP) and robotics. Let’s take a look at some of these other recent developments from Meta FAIR.

Meta’s Spirit LM: AI Innovations in Language and Multimodal Models

Meta’s Spirit LM is a new multimodal language model that combines text and speech capabilities, making interactions with AI feel more natural. Unlike traditional models that handle only text or only speech, Spirit LM can seamlessly switch between the two. 

Spirit LM can understand and generate language in ways that feel more human-like. For example, it can enhance virtual assistants that can both listen and respond in spoken or written language, or support accessibility tools that convert between speech and text. 

Fig 3. An Example of Text-to-Speech Using Meta Spirit LM.

Moreover, Meta has developed techniques to make large language models more efficient. One of these, called Layer Skip, helps reduce computational needs and energy costs by only activating the layers that are necessary for a given task. This is especially useful for applications on devices with limited memory and power. 

Taking the need to deploy AI applications on such devices a step further, Meta has also rolled out quantized versions of its Llama models. These models are compressed to run faster on mobile devices without sacrificing accuracy

A Look at the Future of Optimization with Meta Lingua

As AI models grow in size and complexity, optimizing their training process has become crucial. With respect to optimization, Meta has introduced Meta Lingua, a flexible and efficient codebase that makes training large language models easier. Meta Lingua’s modular design lets researchers quickly customize and scale their experiments. 

Researchers can spend less time on technical setup and more time on actual research. The codebase is also lightweight and easy to integrate, making it suitable for both small experiments and large-scale projects. By removing these technical hurdles, Meta Lingua helps researchers make faster progress and test new ideas with greater ease.

Fig 4. An Overview of Meta Lingua.

Meta’s Enhancements in AI Security

As quantum computing technology advances, it brings new challenges to data security. Unlike today’s computers, it’s likely that quantum computers will be able to solve complex calculations much faster. This means they could potentially break the encryption methods currently used to protect sensitive information. That’s why research in this field is becoming increasingly important - developing new ways to protect data are essential as we prepare for the future of quantum computing.

To address this, Meta has developed Salsa, a tool aimed at strengthening post-quantum cryptographic security. Salsa helps researchers test AI-driven attacks and identify potential weaknesses, enabling them to better understand and address the vulnerabilities in cryptographic systems. By simulating advanced attack scenarios, Salsa provides valuable insights that can guide the development of stronger, more resilient security measures for the quantum era.

AI at Meta: Latest Innovations in Robotics

Meta’s latest work in robotics focuses on helping AI interact more naturally with the physical world by enhancing touch perception, dexterity, and collaboration with humans. In particular, Meta Digit 360 is an advanced tactile sensor that gives robots a refined sense of touch. The sensors help robots to detect details like texture, pressure, and even object shapes. From these insights, robots can handle objects with more precision; something that’s crucial in areas like healthcare and manufacturing.

Here are some of the key features that the Meta Digit 360 includes:

  • It is equipped with 18 distinct sensing features to be able to capture a wide range of tactile details.
  • The sensor can detect pressure changes as small as 1 millinewton, enabling robots to respond to fine textures and subtle movements.
  • It includes over 8 million taxels (tiny sensing points) across the fingertip surface, providing a high-resolution map of touch information.

An extension of the Meta Digit 360 is the Meta Digit Plexus, a platform that integrates various touch sensors onto a single robotic hand. This setup allows robots to process touch information from multiple points at once, similar to how human hands gather sensory data.

Fig 5. The Meta Digit Plexus.

Setting the Stage for AI’s Next Chapter

Meta’s latest AI updates, ranging from advances in computer vision with SAM 2.1 and CoTracker3 to new developments in language models and robotics, show how AI is steadily moving from theory into practical, impactful solutions. 

These tools are designed to make AI more adaptable and useful across different fields, helping with everything from segmenting complex images to understanding human language and even working alongside us in physical spaces. 

By prioritizing accessibility and real-world application, Meta FAIR is bringing us closer to a future where AI can tackle real-world challenges and enhance our daily lives in meaningful ways. 

Are you curious about AI? Join our community for the latest updates and insights, and check out our GitHub repository. You can also explore how computer vision can be used in industries like self-driving cars and agriculture!

Logótipo do FacebookLogótipo do TwitterLogótipo do LinkedInSímbolo de ligação de cópia

Ler mais nesta categoria

Vamos construir juntos o futuro
da IA!

Começa a tua viagem com o futuro da aprendizagem automática