Safety·Anthropic·Feb 2026

59. ExtractBench: Benchmarking Structured Extraction from Documents

35 PDF-to-JSON extraction tasks revealing frontier models fail on complex document schemas

Research Paper

Summary

Introduced ExtractBench, a benchmark for structured extraction across finance, academia, hiring, and sports domains. 35 PDF-to-JSON tasks with gold labels covering 12,867 fields. Surprising findings: frontier models (Claude, GPT-5, Gemini) are unreliable on complex schemas — 0% valid output on 369-field financial documents. Most surprisingly, structured output APIs actually worsened extraction performance compared to free-form JSON generation. Reveals a major capability gap in enterprise AI applications.

Key Concepts

PDF-to-JSON Extraction Tasks

35 tasks spanning four domains: finance (investment documents, regulatory filings), academia (research papers, citations), hiring (resumes, cover letters), and sports (statistics, performance data). Each task requires extracting structured data with precise field definitions and gold-labeled validation. Tasks range from simple (10-50 fields) to complex (300+ fields) schema extraction.

Structured Output APIs

Frontier models now support specialized structured output modes (JSON Schema, OpenAI's structured outputs, Claude's output JSON schemas). These APIs are designed to improve extraction reliability. Counter-intuitively, ExtractBench shows that free-form JSON generation often outperforms constrained structured output modes, suggesting the APIs introduce their own failure modes or constraint handling issues.

Performance Cliff on Complex Schemas

Frontier models (Claude 3.5 Sonnet, GPT-4o, Gemini 2.0 Pro) achieve 0% valid output on 369-field financial extraction tasks, while simpler tasks show 40-70% success rates. The problem isn't missing fields—it's malformed JSON or schema violations that break parsing entirely, rendering the extraction useless in production pipelines.