SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Models·OpenAI·Sep 2022

15. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)

Scale applied to speech recognition

Research Paper
Summary

Released Whisper, a speech recognition model trained on 680,000 hours of weakly supervised web audio, achieving near-human accuracy across multiple languages and becoming the open-source standard for speech-to-text.

Key Concepts

680K hours of weakly supervised web audio — scale over curation

680,000 hours of audio from the internet with corresponding text (web videos, podcasts, etc.) — weak supervision rather than human-transcribed audio.

Encoder-decoder transformer: audio → spectrogram → text

An encoder-decoder transformer. Audio → log-mel spectrogram → encoder → decoder → text.

Single model handles transcription, translation, and language ID across 97 languages

Whisper handles transcription, translation (to English), and language identification in a single model.

Near-human accuracy on many benchmarks with strong multilingual support

Approached human-level accuracy on many benchmarks, with strong multilingual support across 97 languages.

Connections

15. Robust Speech Re…Sep 20228. Language Models …May 202026. Hello GPT-4oMay 2024Influenced byInfluences
Influenced by
8. Language Models are Few-Shot Learners (GPT-3)
May 2020
Influences
26. Hello GPT-4o
May 2024