Text-to-video enters the frontier
Product AnnouncementAnnounced Sora, a text-to-video diffusion model capable of generating up to 60 seconds of high-fidelity video with complex scenes, multiple characters, and camera motion — a step-change in generative video quality.
A diffusion transformer model operating on spacetime patches of video and images. It processes video as sequences of spacetime patches (analogous to tokens in language models).
Up to 60 seconds of video, complex camera motion, multiple characters with persistent identity, physically plausible (though not perfect) interactions.
OpenAI framed Sora not just as a video generator but as a "world simulator" — a model that understands physics, causality, and 3D consistency by learning from video data.