The omnimodal model
Product AnnouncementLaunched GPT-4o ('omni'), a natively multimodal model that processes text, audio, and vision in a single end-to-end architecture, enabling real-time voice conversation with emotional expressiveness and sub-200ms latency.
A single neural network that accepts any combination of text, audio, image, and video as input, and generates text, audio, and image outputs. Audio generation includes tone, emotion, and singing.
~232ms average response time for audio — comparable to human conversational latency.
GPT-4o was made available to free ChatGPT users, massively expanding access.
50% cheaper than GPT-4 Turbo on the API, with 2x faster throughput.