Scale applied to speech recognition
Research PaperReleased Whisper, a speech recognition model trained on 680,000 hours of weakly supervised web audio, achieving near-human accuracy across multiple languages and becoming the open-source standard for speech-to-text.
680,000 hours of audio from the internet with corresponding text (web videos, podcasts, etc.) — weak supervision rather than human-transcribed audio.
An encoder-decoder transformer. Audio → log-mel spectrogram → encoder → decoder → text.
Whisper handles transcription, translation (to English), and language identification in a single model.
Approached human-level accuracy on many benchmarks, with strong multilingual support across 97 languages.