SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Products·OpenAI·Apr 2022

14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents

Photorealistic text-to-image generation

Research Paper
Summary

DALL-E 2 used CLIP embeddings + diffusion models to generate photorealistic images from text at 1024x1024, dramatically improving on DALL-E's quality and becoming the catalyst for the generative image explosion.

Key Concepts

Prior converts text → CLIP image embeddings, then diffusion decoder generates the image
1024×1024 resolution via CLIP latent space — 4x leap over DALL-E's 256×256

By working in CLIP's latent space, DALL-E 2 could generate semantically coherent images at 1024×1024 resolution — a massive leap from DALL-E's 256×256.

Image variations, inpainting, and text-guided editing unlock creative workflows

Image variations, inpainting, text-guided editing.

Connections

14. DALL-E 2: Hierar…Apr 202210. Zero-Shot Text-t…Jan 202111. Learning Transfe…Feb 202125. Sora: Creating v…Feb 2024Influenced byInfluences
Influenced by
10. Zero-Shot Text-to-Image Generation (DALL-E)
Jan 2021
11. Learning Transferable Visual Models (CLIP)
Feb 2021
Influences
25. Sora: Creating video from text
Feb 2024