Models·OpenAI·Sep 2022

15. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)

Scale applied to speech recognition

Research Paper

Summary

Released Whisper, a speech recognition model trained on 680,000 hours of weakly supervised web audio, achieving near-human accuracy across multiple languages and becoming the open-source standard for speech-to-text.

Key Concepts

680K hours of weakly supervised web audio — scale over curation

680,000 hours of audio from the internet with corresponding text (web videos, podcasts, etc.) — weak supervision rather than human-transcribed audio.

Encoder-decoder transformer: audio → spectrogram → text

An encoder-decoder transformer. Audio → log-mel spectrogram → encoder → decoder → text.

Single model handles transcription, translation, and language ID across 97 languages

Whisper handles transcription, translation (to English), and language identification in a single model.

Near-human accuracy on many benchmarks with strong multilingual support

Approached human-level accuracy on many benchmarks, with strong multilingual support across 97 languages.

Connections

Influenced by

8. Language Models are Few-Shot Learners (GPT-3)

May 2020

Influences

26. Hello GPT-4o

May 2024