Products·OpenAI·Feb 2024

25. Sora: Creating video from text

Text-to-video enters the frontier

Product Announcement

Summary

Announced Sora, a text-to-video diffusion model capable of generating up to 60 seconds of high-fidelity video with complex scenes, multiple characters, and camera motion — a step-change in generative video quality.

Key Concepts

Diffusion transformer on spacetime patches — video as sequences of visual tokens

A diffusion transformer model operating on spacetime patches of video and images. It processes video as sequences of spacetime patches (analogous to tokens in language models).

Up to 60s of coherent video with complex camera motion and persistent characters

Up to 60 seconds of video, complex camera motion, multiple characters with persistent identity, physically plausible (though not perfect) interactions.

Framed as a "world simulator" that learns physics and causality from video data

OpenAI framed Sora not just as a video generator but as a "world simulator" — a model that understands physics, causality, and 3D consistency by learning from video data.

Connections

Influenced by

14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents

Apr 2022

18. GPT-4 Technical Report

Mar 2023

Influences

30. 12 Days of OpenAI: o3, Sora, and More

Dec 2024