Products·OpenAI·Jan 2021

10. Zero-Shot Text-to-Image Generation (DALL-E)

When language models learned to see and create

Research Paper

Summary

Demonstrated that a 12B parameter autoregressive transformer trained on text-image pairs could generate coherent images from natural language descriptions, opening the era of text-to-image AI.

Key Concepts

12B autoregressive transformer generating images as sequences of discrete visual tokens

A 12B parameter autoregressive transformer that takes text tokens and generates image tokens. Images are encoded into 32×32 grids of discrete tokens using a discrete variational autoencoder (dVAE).

Generate images from any text prompt, ranked by CLIP for fidelity

Given any text prompt, DALL-E generates multiple candidate images. A separate CLIP model ranks the candidates for fidelity to the prompt.

Combines never-before-seen concepts ("avocado armchair") — compositional understanding

DALL-E could combine concepts it had never seen together — "an armchair in the shape of an avocado" — demonstrating compositional understanding.

Connections

Influenced by

8. Language Models are Few-Shot Learners (GPT-3)

May 2020

Influences

14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents

Apr 2022