See how Meta Movie Gen is redefining video and sound creation. Learn how this model offers precise video editing and supports personalized media creation.
Whether you’re an aspiring filmmaker or a content creator who enjoys making videos for your audience, having AI tools that expand your creativity is always helpful. Recently, Meta launched its latest generative video model, known as Meta Movie Gen.
The global generative AI market in media and entertainment is projected to reach $11.57 billion by 2033, with companies like Runway, OpenAI, and Meta leading the way in groundbreaking innovations. Meta Movie Gen, in particular, is great for applications like filmmaking, video content creation, and digital storytelling, making it easier than ever to bring creative visions to life through high-quality, AI-generated videos. In this article, we’ll explore Meta Movie Gen and how it works. We’ll also take a closer look at some of its applications. Let’s get started!
Before we discuss what Meta Movie Gen is, let’s take a look at how it came to be.
Meta’s research efforts related to generative AI started with their Make-A-Scene series of models. This research focuses on a multimodal generative AI method that helps artists and visionaries bring their imagination to life. Artists can input images, audio, videos, or 3D animations to get the image output that they desire. The next leap in innovation came with diffusion models like the Llama Image Foundation models (Emu), which made it possible to generate images and videos of much higher quality and enabled image editing.
Movie Gen is Meta’s latest contribution to generative AI research. It combines all of the previously mentioned modalities and allows further fine-grained control so that people can use the models in more creative ways. Meta Movie Gen is a collection of foundational models for generating different types of media, including text-to-video, text-to-audio, and text-to-image. It consists of four models, which are trained on a combination of licensed and publicly available datasets.
Here’s a quick overview of these models:
Several key processes were involved in creating and training the Movie Gen Video model. The first step involved collecting and preparing visual data, including images and video clips, primarily of human activities filtered for quality, motion, and relevance. The data was then paired with text captions that explained what was happening within each scene. The captions, generated using Meta’s LLaMa3-Video model, provided rich details about the content of each scene, enhancing the model’s visual storytelling capabilities.
The training process began with the model learning to transform text into low-resolution images. It then progressed to creating full video clips through a combination of text-to-image and text-to-video training, using increasingly high-quality visuals.
A tool called the Temporal Autoencoder (TAE) compressed the videos to manage large volumes of data efficiently. Fine-tuning further sharpened the video quality, and a method called model averaging (it combines multiple model outputs for smoother, more consistent results) ensured greater output consistency. Finally, the video, initially at 768p, was upscaled to a sharp 1080p resolution using a spatial upsampler technique, which increases image resolution by adding pixel data for clearer visuals. The result was high-quality, detailed video outputs.
The Meta Movie Gen models primarily support four different abilities. Let’s take a closer look at each of them.
Meta Movie Gen can generate high-quality videos. These video clips can be up to 16 seconds long and run at 16 fps (frames per second), creating realistic visuals that capture motion, interactions, and camera angles from text prompts. Paired with the 13-billion-parameter audio model, it can produce synced audio, including ambient sounds, Foley effects, and music, to match the visuals.
This setup ensures a seamless, lifelike experience, where both visuals and audio stay aligned and realistic across various scenes and prompts. For instance, these models were used to create video clips of the viral pigmy hippopotamus of Thailand, named Moo Deng.
Another interesting capability of the Meta Movie Gen model is personalized video generation. Users can provide a person’s image and a text prompt describing how the video clip should be generated, resulting in a video that includes the reference person and incorporates the rich visual details specified in the text prompt. The model uses both inputs (image and text) to keep the person’s unique appearance and natural body movements, while accurately following the scene described in the prompt.
Using the Movie Gen Edit model, users can provide both a video clip and a text prompt as input to edit the video in creative ways. The model combines video generation with advanced image editing to perform very specific edits, such as adding, removing, or replacing elements. It can also perform global changes like modifying the background of the video clip or the overall style. But what makes the model truly unique is its precision: it can target only the specific pixels that require editing and leave the rest untouched. This preserves the original content as much as possible.
Along with the generative AI models, Meta also introduced Movie Gen Bench, a suite of benchmarking tools for testing the performance of generative AI models. It comes with two main tools: Movie Gen Video Bench and Movie Gen Audio Bench. Both are designed to test different aspects of video and audio generation.
Here’s a glimpse of both tools:
Now that we’ve covered what the Meta Movie Gen models are and how they work, let’s explore one of their practical applications.
One of the most exciting uses of Meta's Movie Gen is how it can transform filmmaking through AI-powered video and audio creation. With Movie Gen, creators can generate high-quality visuals and sounds from simple text prompts, opening up new ways to tell stories.
In fact, Meta teamed up with Blumhouse and a group of filmmakers, gathering their feedback on how Movie Gen can best support the creative process. Filmmakers like Aneesh Chaganty, the Spurlock Sisters, and Casey Affleck tested the tool's ability to capture mood, tone, and visual direction. They discovered that the models helped spark fresh ideas.
This pilot program has shown that while Movie Gen doesn’t replace traditional filmmaking, it offers directors a new way to experiment with visual and audio elements quickly and creatively. The filmmakers also appreciated how the tool’s editing features let them play with background sounds, effects, and visual styles more freely.
Meta Movie Gen is a step forward in using generative AI to make high-quality videos and sounds from simple text descriptions. The tool helps users easily create realistic and custom videos. With capabilities like precise video editing and personalized media generation, Meta Movie Gen offers a flexible toolset that opens up fresh possibilities for storytelling, filmmaking, and beyond. By making it easier to create detailed and useful visuals, Meta Movie Gen is transforming how videos are made and used across different fields and setting a new standard for AI-driven content creation.
To learn more, visit our GitHub repository, and engage with our community. Explore AI applications in self-driving cars and agriculture on our solutions pages. 🚀
Comienza tu viaje con el futuro del aprendizaje automático