OpenAI’s New Super Model: Whisper Achieves Human Level Performance in Speech Recognition

The new model combines zero-shot-learning with supervised fine tuning to match human level performance across different speech recognition tasks.

Jesus Rodriguez
3 min readSep 27, 2022

--

Image Source: https://huggingface.co/spaces/openai/whisper

I recently started an AI-focused educational newsletter, that already has over 125,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

Automatic speech recognition(ASR) is one of the deep learning disciplines that have seen a tremendous level of innovation in the last few years. Part of this innovation catalyst have come from the emergence of unsupervised pretraining techniques which are able to learn from raw audio without explicitly dependencies on human labelers. Wav2Vec is the canonical example of this type of unsupervised method. Despite the progress, ASR systems have yet still to achieve human level performance across most benchmarks. AI powerhouse OpenAI recently set its eyes on this challenge. The result is Whisper, an ASR model that show human levels of accuracy and robustness outperforming both supervised and unsupervised models in the space.

The great insight of Whisper was to combine unsupervised pretrained models with high quality labeled datasets. While the pretrained audio encoder architecture prevalent in pretrained model excels in learning audio representations, it lacks equally powerful decoders that can map those representations to outputs that match human level performance. To address this challenge, Whisper uses a decoder trained on 680,000 hours of highly labeled multilingual and multitask data which can be used to fine tuned the pretrained model.

As you probably guessed, Whisper’s architecture is based on an encoder-decoder transformer model. The encoder layers consists…

--

--

Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, I write The Sequence Newsletter, Guest lecturer at Columbia University and Wharton, Angel Investor, Author, Speaker.