Inside Large World Model: UC Berkeley Multimodal Model that can Understand 1 Hour Long Videos

The model can really advance the applicability of foundation models in complex environments.

6 min readFeb 26, 2024

I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Building AI models and agents that fully understand complex environments have long been one of the goals of AI. The recent generative AI revolution have expanded the horizons of AI models in order to understand environments using language, video and images. Obviously, video understanding seems to be the key to unlock this capability as videos include features such as object interaction, physics and other key characteristics of real world settings. A group of AI researchers from UC Berkeley that include AI legend Peiter Abbeel published a paper proposing a model that can learn complex representations from images and videos in seuqences of up to one million tokens. They named the model: large world model(LWM).

The Problem

Today’s language models have difficulty grasping world aspects that are challenging to encapsulate solely through text, especially when it comes to managing intricate, extended tasks. Videos provide a rich source of temporal information that static images and text cannot offer, highlighting the potential benefits of integrating video with language in model training. This integration aims to create models that comprehend both textual knowledge and the physical world, broadening AI’s potential to assist humans. Nevertheless, the ambition to learn from millions of tokens spanning video and language sequences is hampered by significant hurdles such as memory limitations, computational challenges, and the scarcity of comprehensive datasets.

Most current methods for understanding the world through AI are limited to short text sequences or brief image and video clips. This limitation restricts AI’s ability to grasp complex aspects of our world that aren’t easily captured in short formats. Moreover, these models struggle with processing detailed, long-form language and visual tasks. To enhance learning from lengthy video and language sequences, it’s essential to develop a model capable of handling millions of tokens in a single sequence. Yet, the challenge lies in the immense memory requirements, computational demands, and the scarcity of appropriate large-scale datasets.

Large World Model(LWM)

LWM seeks to overcome these obstacles by adopting a holistic approach that combines the rich, dynamic information available in video sequences with textual data. This strategy is focused on fostering a deeper understanding of both human knowledge expressed through language and the nuanced, often complex realities of the physical world. By doing so, LWM aims to enhance AI’s ability to perform a wider array of tasks, offering more substantial assistance to humans across different domains.

LWM addresses these challenges head-on by training a significant autoregressive transformer model designed to work with up to a million tokens at a time, building on the foundation of the advanced Llama2 7B. This ambitious goal is achieved by adopting a multi-faceted approach divided in two fundamental steps:

1. Expanding the context window to 1 million tokens through extensive text sources like books.

2. Performing joint training across varied long multimodal sequences, including combinations of text, images, videos, and literature.

1) Stage I: Developing Long-Context Language Models

The initial stage is dedicated to creating LWM-Text and LWM-Text-Chat, focusing on long-context language models. This is achieved by progressively training on data with increasing sequence lengths, utilizing innovative techniques like RingAttention for efficient processing and modified positional encoding to handle the extended lengths.

Addressing Scalability and Memory Issues

Handling long documents presents a significant challenge due to memory limitations and the computational burden of traditional attention mechanisms. LWM leverages RingAttention, which introduces block-wise computation for scalability, and integrates FlashAttention with Pallas optimization for enhanced performance. This approach ensures efficient use of resources, allowing for theoretically unlimited context sizes depending on available hardware.

Progressive Training for Efficiency

Despite the ability to process long documents, the computational cost remains a concern. LWM employs a progressive training strategy, beginning with shorter sequences of 32K tokens and incrementally increasing to 1 million tokens. This method conserves computational resources by focusing on shorter-range dependencies before tackling longer sequences, enabling the model to learn from a significantly larger volume of tokens than would be possible with direct training on maximum length sequences from the start.

2) Stage II: Enhancing Vision-Language Model Capabilities

The second phase broadens LWM’s scope by integrating long video and language sequences into the training process. This includes architectural adjustments to accommodate vision input and discussions on training with various sequence lengths. By fine-tuning the language model on vision-language data, LWM significantly improves its understanding of complex, lengthy sequences.

Core Architecture

At its core, LWM operates as an autoregressive transformer capable of processing sequences with millions of tokens. Video frames are tokenized and combined with text tokens for processing, with special delimiters used to distinguish between image and text inputs. This setup allows LWM to train across multiple modalities, from text and images to videos, enhancing its ability to tackle diverse and complex tasks involving extensive language and visual data.

The Results

LWM displays a remarkably diverse set of capabilities across different tasks.

Let’s look at a few examples:

1) Long Video Understanding

LWM is trained to attend long sequences of up to 1M tokens which results in advanced understanding of long videos. Take a look at the following image which highlights comprehension over a 1 hour long video:

2) Text-to-Image Generation

LWM uses autoregressive methods to generate images from text prompts.

3) Text-to-Video Generation

Similarly, LWM can generate videos from text prompts also autoregressively.

4) Image Based Conversation

LWM is able to reason and answer questions about images.

LWM provides a strong foundation for building models that understand world models by combining video and images. LWM uses Ring Attention as a mechanism for scaling training dataset for sequences of up to 1M tokens which drastically improve the ability of these models to interact with long videos. The model represents another important step in multimodal AI. Looking forward to see this research expanded into new areas.