When language models learned to see and create
Research PaperDemonstrated that a 12B parameter autoregressive transformer trained on text-image pairs could generate coherent images from natural language descriptions, opening the era of text-to-image AI.
A 12B parameter autoregressive transformer that takes text tokens and generates image tokens. Images are encoded into 32×32 grids of discrete tokens using a discrete variational autoencoder (dVAE).
Given any text prompt, DALL-E generates multiple candidate images. A separate CLIP model ranks the candidates for fidelity to the prompt.
DALL-E could combine concepts it had never seen together — "an armchair in the shape of an avocado" — demonstrating compositional understanding.