Benchmark measuring AI performance across 44 professional occupations using real workplace tasks
Research PaperOpenAI introduced GDPval, a benchmark evaluating AI model performance across 44 professional occupations using real workplace tasks. Unlike synthetic benchmarks, GDPval tasks were sourced from actual professionals and validated by domain experts. GPT-5.4 matched or exceeded industry professionals in 83% of occupational comparisons. The benchmark provides granular per-occupation capability profiles, enabling targeted deployment decisions rather than binary 'capable/not capable' assessments.
GDPval covers occupations from software engineering and legal analysis to medical diagnosis, financial modeling, and creative writing. Tasks were sourced from actual professionals and validated by domain experts to ensure they represent genuine workplace challenges, not simplified approximations. Each occupation includes 50-100 tasks spanning routine to expert-level difficulty.
GPT-5.4 matched or exceeded industry professionals in 83% of occupational task comparisons. However, the remaining 17% showed significant performance gaps — particularly in occupations requiring physical context awareness, sustained relationship management, or creative judgment under ambiguity. The gap pattern is as informative as the matching rate.
Rather than a single aggregate score, GDPval provides capability profiles per occupation showing exactly which task types AI handles well and which remain challenging. This enables organizations to make targeted deployment decisions — using AI for specific task types within an occupation rather than attempting wholesale automation.