Models·OpenAI·May 2024

26. Hello GPT-4o

The omnimodal model

Product Announcement

Summary

Launched GPT-4o ('omni'), a natively multimodal model that processes text, audio, and vision in a single end-to-end architecture, enabling real-time voice conversation with emotional expressiveness and sub-200ms latency.

Key Concepts

Single neural network processes text, audio, image, and video — no pipeline needed

A single neural network that accepts any combination of text, audio, image, and video as input, and generates text, audio, and image outputs. Audio generation includes tone, emotion, and singing.

~232ms audio response time — comparable to human conversational latency

~232ms average response time for audio — comparable to human conversational latency.

GPT-4o made available free, massively expanding access to frontier AI

GPT-4o was made available to free ChatGPT users, massively expanding access.

50% cheaper than GPT-4 Turbo with 2x faster throughput on the API

50% cheaper than GPT-4 Turbo on the API, with 2x faster throughput.

Connections

Influenced by

15. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)

Sep 2022

21. GPT-4V(ision) System Card

Sep 2023

Influences

30. 12 Days of OpenAI: o3, Sora, and More

Dec 2024