Photorealistic text-to-image generation
Research PaperDALL-E 2 used CLIP embeddings + diffusion models to generate photorealistic images from text at 1024x1024, dramatically improving on DALL-E's quality and becoming the catalyst for the generative image explosion.
By working in CLIP's latent space, DALL-E 2 could generate semantically coherent images at 1024×1024 resolution — a massive leap from DALL-E's 256×256.
Image variations, inpainting, text-guided editing.