Models·OpenAI·Apr 2022

14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents

Photorealistic text-to-image generation

Research Paper

Summary

DALL-E 2 used CLIP embeddings + diffusion models to generate photorealistic images from text at 1024x1024, dramatically improving on DALL-E's quality and becoming the catalyst for the generative image explosion.

Key Concepts

Prior converts text → CLIP image embeddings, then diffusion decoder generates the image

1024×1024 resolution via CLIP latent space — 4x leap over DALL-E's 256×256

By working in CLIP's latent space, DALL-E 2 could generate semantically coherent images at 1024×1024 resolution — a massive leap from DALL-E's 256×256.

Image variations, inpainting, and text-guided editing unlock creative workflows

Image variations, inpainting, text-guided editing.

Connections

Influenced by

10. Zero-Shot Text-to-Image Generation (DALL-E)

Jan 2021

11. Learning Transferable Visual Models (CLIP)

Feb 2021

Influences

25. Sora: Creating video from text

Feb 2024