Inside Data2vec 2.0: Meta AI New Self-Supervised Model for Vision, Speech and Text

The new model presents major performance improvemetns over its predecessor.

Jesus Rodriguez
3 min readJan 5

--

Image Credit: Meta AI

I recently started an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

Earlier last year, Meta AI unveiled Data2vec, one of the first self-supervised learning models to ever master tasks across different domains such as speech, text and vision. The model was one of the first iterations in Meta AI’s self-supervised architectures that emulate human learning processes using different sensorial inputs. A few weeks ago, Meta AI followed up with Data2vec 2.0, a new version of the models that shows 16x performance improvement.

The original Data2vec architecture based on a student and a teacher network. The teacher network computes representations for a text, image, or speech. The student network takes that output and attempts to predict the latent representations back to the teacher. The two neural networks are nearly identical.

Image Credit: Meta AI

The magic of Data2vec is its ability to predict contextualized representations across the entire training set. For example, the representation of a given word will be based on the entire sentence the word appears in which leads to more efficient learning patters.

Data2vec 2.0

Data2vec 2.0 improve on its predecessor on several ways:

1) The model takes target representations for a given training example and reuse it for masked versions that include random parts of the training dataset

--

--

Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, I write The Sequence Newsletter, Guest lecturer at Columbia University and Wharton, Angel Investor, Author, Speaker.