35 PDF-to-JSON extraction tasks revealing frontier models fail on complex document schemas
Research PaperIntroduced ExtractBench, a benchmark for structured extraction across finance, academia, hiring, and sports domains. 35 PDF-to-JSON tasks with gold labels covering 12,867 fields. Surprising findings: frontier models (Claude, GPT-5, Gemini) are unreliable on complex schemas — 0% valid output on 369-field financial documents. Most surprisingly, structured output APIs actually worsened extraction performance compared to free-form JSON generation. Reveals a major capability gap in enterprise AI applications.
35 tasks spanning four domains: finance (investment documents, regulatory filings), academia (research papers, citations), hiring (resumes, cover letters), and sports (statistics, performance data). Each task requires extracting structured data with precise field definitions and gold-labeled validation. Tasks range from simple (10-50 fields) to complex (300+ fields) schema extraction.
Frontier models now support specialized structured output modes (JSON Schema, OpenAI's structured outputs, Claude's output JSON schemas). These APIs are designed to improve extraction reliability. Counter-intuitively, ExtractBench shows that free-form JSON generation often outperforms constrained structured output modes, suggesting the APIs introduce their own failure modes or constraint handling issues.
Frontier models (Claude 3.5 Sonnet, GPT-4o, Gemini 2.0 Pro) achieve 0% valid output on 369-field financial extraction tasks, while simpler tasks show 40-70% success rates. The problem isn't missing fields—it's malformed JSON or schema violations that break parsing entirely, rendering the extraction useless in production pipelines.