Inside Emu Video and Emu Edit: Meta AI’s New Milestones in Generative Video
Emu Video is focused on video generation while Emu Edit offers a new image editing method.
I recently started an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
Video generation is rapidly becoming one of the next frontiers for generative AI. Video is one of the dominant forms of content in the internet and a foundational block of new trends such as virtual reality. However, generative video posses quite a few challenges compared to domains such as text or audio generation. For starters, video generation models need to have quite a strong understanding of the physics of an environment, accurately represent interactions between objects, textures/materials and, obviously, capture interactions throughout time. Additionally, the datasets for training video models are considerably smaller than other domains. Not surprisingly, generative video has been trailing generative AI disciplines such as text, image or audio. But the space is moving quite rapidly.
Recently, Meta AI Research introduced two new models: Emu Video and Emu Edit that push the boundaries of generative videos. Today, I would like to dive into both models.
Emu Video
With Emu Video, Meta AI Research challenges the conventional approach to video generation, where diffusion models are the dominant paradigm for generating all video frames simultaneously. In stark contrast to this, LLMs tackles long sequence generation as an autoregressive problem. This means predicting one word based on previously predicted words, with the conditioning signal for each subsequent prediction gradually strengthening. The hypothesis here is that reinforcing the conditioning signal is equally crucial for achieving high-quality video generation, given its inherently temporal nature. However, implementing autoregressive decoding with diffusion models poses a significant challenge, as generating a single frame from such models demands multiple iterations.
Emu Video, Meta AI Research introduces new ideas in text-to-video generation by breaking down the generation process into two distinctive steps. The first step involves generating an image based on the provided text. The second step involves the generation of a video, taking into account both the text and the previously generated image. The team has identified crucial design choices, including adapted noise schedules for diffusion and multi-stage training, that empower them to directly produce high-quality, high-resolution videos without the need for a deep cascade of models, as seen in earlier approaches.
The Architecture
In terms of architecture and initialization, Meta AI Research leverages the text-to-image U-Net architecture from previous work and initializes all spatial parameters with a pretrained model. This pretrained model produces 512px square images, using an 8-channel 64x64 latent representation, as the autoencoder downscales spatially by a factor of 8. To extract features from the text prompt, the model employs both a frozen T5-XL and a frozen CLIP text encoder, with separate cross-attention layers in the UNet dedicated to each text feature. The model comprises 2.7 billion frozen spatial parameters and 1.7 billion temporal parameters that are learned.
To streamline computational complexity, the training process unfolds in two stages.
1. First, for the majority of training iterations (70,000), the team focuses on a simpler task, generating 256px, 8fps, 1-second videos. This reduction in spatial resolution results in a 3.5x reduction in per-iteration time.
2. Then, the model transitions to the desired 512px resolution, training on 4fps, 2-second videos for an additional 15,000 iterations.
The Results
Meta AI Research evaluated Emu Video against state-of-the-art video generation models including Make-a-Video (MAV), Imagen-Video (Imagen), Align Your Latents (AYL), Reuse & Diffuse (R&D), Cog Video (Cog), Gen2 (Gen2) and Pika Labs (Pika). The evaluation relied on human raters to select the higher quality videos. The results speak by themselves.
The quality of Emu Video is quite astonishing:
Emu Edit
The second contribution from Meta AI Research was Emu Edit, a versatile image editing model that redefines the landscape of instruction-based image editing. Emu Edit’s development involves adapting its architecture for multi-task learning, allowing it to excel in a wide range of tasks, including region-based editing, free-form editing, and computer vision tasks like detection and segmentation, all cast as generative tasks.
Emu Edit’s foundation rests on two fundamental contributions. First, the model undergoes multi-task training across sixteen distinct image editing tasks, spanning region-based and free-form editing, as well as computer vision tasks. A unique data curation pipeline is crafted for each task, ensuring a diverse and precise training set. The results show that training a unified model across all tasks outperforms training separate expert models for each task. Intriguingly, computer vision tasks such as detection and segmentation remarkably enhance the editing performance, validated by both human raters and quantitative metrics.
Secondly, to efficiently handle this array of tasks, Emu Edit introduces the concept of learned task embeddings. These embeddings guide the generation process towards the correct generative task. For each task, a distinct task embedding vector is learned and integrated into the model through cross-attention interactions and timestep embeddings. This innovation significantly enhances the model’s ability to decipher the appropriate edit type from free-form instructions and execute the correct edit.
Armed with a robust model trained across a diverse task spectrum and guided by learned task embeddings, Meta AI Research delves into few-shot adaptation for previously unseen tasks through task inversion. In this process, the model’s weights remain untouched, with only the task embedding updated to align with the new task. Experiments demonstrate Emu Edit’s agility in swiftly adapting to new tasks, including super-resolution and contour detection. Impressively, for certain tasks, fine-tuning the model on a handful of examples nearly matches the performance of an expert model trained on a hundred thousand examples. This makes task inversion with Emu Edit particularly advantageous in scenarios with limited labeled examples or constrained computational resources.
Lastly, to foster advancements in instruction-based image editing research, Meta AI Research publicly releases a comprehensive benchmark. This benchmark encompasses seven diverse image editing operations and includes Emu Edit’s generations on this dataset, providing a valuable resource for future endeavors in the field.