SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Products·OpenAI·Jan 2021

10. Zero-Shot Text-to-Image Generation (DALL-E)

When language models learned to see and create

Research Paper
Summary

Demonstrated that a 12B parameter autoregressive transformer trained on text-image pairs could generate coherent images from natural language descriptions, opening the era of text-to-image AI.

Key Concepts

12B autoregressive transformer generating images as sequences of discrete visual tokens

A 12B parameter autoregressive transformer that takes text tokens and generates image tokens. Images are encoded into 32×32 grids of discrete tokens using a discrete variational autoencoder (dVAE).

Generate images from any text prompt, ranked by CLIP for fidelity

Given any text prompt, DALL-E generates multiple candidate images. A separate CLIP model ranks the candidates for fidelity to the prompt.

Combines never-before-seen concepts ("avocado armchair") — compositional understanding

DALL-E could combine concepts it had never seen together — "an armchair in the shape of an avocado" — demonstrating compositional understanding.

Connections

10. Zero-Shot Text-t…Jan 20218. Language Models …May 202014. DALL-E 2: Hierar…Apr 2022Influenced byInfluences
Influenced by
8. Language Models are Few-Shot Learners (GPT-3)
May 2020
Influences
14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents
Apr 2022