Inside Yi: The Chinese Multimodal Foundation Models that has Achieved Remarkable Performance in Image and Language Tasks

Created by AI startup 01, the model is being really competitive with western alternative.

Jesus Rodriguez
6 min readMar 11, 2024
Created Using DALL-E

I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

The Chinese ecosystem around foundation models have been on fire recently. Releases from Alibaba, DeepSeek, Smaugand several others . One of the most ambitious foundation models effort in China comes from 01, the startup founded by former Microsoft and Google researcher Kai Fu Lee. 01’s first iteration came in the form of the Yi models. The release is based on a series of multimodal models optimized for both English and Chinese datasets. Last week, 01 published a technical report about the Yi models and we thought it would be interesting to share some details.

The Yi series models stands out for their bilingual capabilities. These models are founded on a massive, 3 trillion-word multilingual dataset, positioning them as one of the top-performing large language models globally. Yi made the headlines when the Yi-34B-Chat variant clinched the second spot, right after GPT-4 Turbo, surpassing competitors like GPT-4, Mixtral, and Claude on the AlpacaEval Leaderboard, as per records until January 2024. Furthermore, the Yi-34B model was ranked the highest among all accessible open-source models, including Falcon-180B, Llama-70B, and Claude, in both English and Chinese languages across different benchmarks like the Hugging Face Open LLM Leaderboard and C-Eval, with data up to November 2023.

These results for Yi become more obvious once we understand its architecture as well as the pretraining and finetuning workflows.


The foundation of Yi is built on a refined version of the well-known decoder-only Transformer structure, with adaptations from LLaMA’s framework. Noteworthy adjustments include:

· Attention Mechanism: Unlike LLaMA, which employs Grouped-Query Attention only in its largest model, Yi integrates this efficient attention mechanism across its 6B and 34B models. This strategy, which organizes query-heads into groups sharing the same key and value head, significantly cuts down on both training and inference expenses without compromising performance, even in its smaller 6B model.

· Activation Function: Yi adopts SwiGLU for its post-attention activation function, adjusting the activation size for consistency and to offset the parameter reduction from employing Grouped-Query Attention. This ensures the model’s parameter count remains competitive with existing models of similar sizes.

· Positional Embedding and Handling Long Contexts: Yi utilizes Rotary Position Embedding to follow standard practices while adjusting its base frequency to cater to longer context windows of up to 200K. This enhancement allows the model to be trained initially on 4K context lengths and then further refined on a dataset with longer sequences, ensuring its adaptability and performance across various applications.

Extended Capabilities

Yi expands the traditional baseline LLM set of capabilities in three fundamental areas:

1) Long Context Modeling

The Yi models enhance their understanding of extended texts through a two-phase approach that’s both efficient and effective. Initially, the models undergo a phase of continual pretraining, designed to tap into their inherent ability to process information from up to a 200K word context. This phase is pivotal for unlocking the models’ capabilities, as shown by their impressive results in finding specific details within large volumes of text. Following this, a finetuning phase customizes the models’ responses, tailoring them to align with human preferences and instructions.

2) Vision-Language Integration

In the evolving field of multimodal research, Yi introduces its Vision Language models, Yi-VL-6B and Yi-VL-34B, expanding its linguistic capabilities to include image understanding. These models build on the foundational Yi-6B-Chat and Yi-34B-Chat, incorporating a Vision Transformer for image encoding, a Projection Module to bridge image and text representations, and the robust Yi-Chat models known for their bilingual proficiency. The development of Yi-VL models benefits significantly from a diverse collection of bilingual image-text pairs, bolstering their performance in understanding and generating content across languages and modalities.

3) Depth Upscaling

The pursuit of model improvement through scaling has consistently shown that larger computational resources, model sizes, and data volumes lead to better performance. However, optimally allocating these resources remains a complex challenge. Drawing on recent research, Yi adopts a novel strategy that dynamically adjusts the investment between model depth and data volume. Through staged training processes, this approach fine-tunes the synergy between the scale of data and model complexity, guided by established scaling laws. This method not only optimizes training efficiency but also enhances the overall performance of the models.

Pretraining Process

Yi’s pretraining involves a comprehensive approach where a standard dense transformer architecture is trained on an extensively curated, large dataset. The guiding principle here is straightforward: with sufficiently high-quality data, a conventional architecture can achieve remarkable performance without significant alterations, despite Yi’s team experimenting with various architectural changes.

The process to ensure data quality involves a carefully constructed data-processing pipeline, aimed at creating a rich bilingual pretraining dataset. Starting with web documents sourced from Common Crawl, Yi utilizes advanced language identification and quality assessment techniques to filter and refine the data. The resulting dataset boasts 3.1 trillion tokens of high-quality content in both English and Chinese, distinguishing itself by its bilingual nature and superior quality, a step above other known data mixtures.

Finetuning Strategy

When it comes to finetuning, Yi prioritizes the quality of data far above its quantity. Contrary to methods that rely on vast amounts of data, Yi’s approach is more meticulous, focusing on a smaller set of data that is carefully examined and refined. The finetuning dataset includes fewer than 10,000 dialog pairs, each crafted and improved through multiple revisions and user feedback. This method has proven to yield better results than using larger, open-source datasets, according to preliminary tests conducted by the team.

Yi’s dataset encompasses a broad range of prompts, ensuring the model’s versatility across different tasks such as question answering, creative writing, dialogue, reasoning, mathematics, coding, and safety, along with bilingual capabilities. This wide-ranging focus underscores Yi’s commitment to delivering a finely-tuned model capable of handling a diverse array of applications.

The Results

Yi-34B-Chat model demonstrates exceptional performance, ranking first among all existing open-source models in the benchmarks including MMLU, CMMLU, BBH, GSM8k, and more.

Yi is one of the most interesting open source foundation model



Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, President of NeuralFabric and founder of The Sequence , Lecturer at Columbia University, Wharton, Angel Investor...