The paper that started the GPT paradigm
Research PaperIntroduced the Generative Pre-trained Transformer (GPT) — demonstrating that unsupervised pre-training on a large text corpus followed by supervised fine-tuning produces strong performance across diverse NLP tasks.
A 12-layer transformer decoder with 117 million parameters, trained on BookCorpus (~7,000 unpublished books).
The key insight: next-token prediction as a pre-training objective captures deep linguistic structure that transfers across tasks.
GPT-1 achieved state-of-the-art on 9 of 12 NLP benchmarks, improving over the best previous approaches by significant margins — and it did this with a single architecture that required only task-specific output layers.