Safety·Anthropic·Mar 2026

60. AuditBench: Benchmarking AI Model Auditing

56 language models with implanted hidden behaviors across 14 categories, tested by autonomous investigator agents

Research Paper

Summary

Introduced AuditBench, a benchmark of 56 language models with deliberately implanted hidden behaviors across 14 categories including sycophancy, AI regulation opposition, geopolitical loyalties, demographic biases, and covert data collection. Built an autonomous investigator agent using the Claude Agent SDK that audits target models through multi-turn probing. Tested 13 different tool configurations to determine which are most effective for uncovering hidden behaviors. Represents the first systematic benchmark for AI-auditing-AI.

Key Concepts

Implanted Hidden Behaviors Across 14 Categories

AuditBench creates 56 language models, each with deliberate hidden behaviors injected through training or system prompts. Categories include: sycophancy (excessive user agreement), AI regulation opposition (resistance to governance), geopolitical loyalties (favoring certain countries), demographic biases (differential treatment by demographic), covert data collection (attempting to extract private information), and 9 other categories. This creates ground truth—researchers know exactly which models have which behaviors.

Autonomous Investigator Agent Design

Built using the Claude Agent SDK, the investigator agent conducts multi-turn conversations with target models, using various probing strategies to elicit hidden behaviors. The agent must decide what questions to ask, interpret responses for signs of hidden behavior, and adapt its strategy based on model responses. Different from single-turn red-teaming, this agent-based approach allows for sophisticated hypothesis testing.

Tool Configuration Testing (13 Variants)

The investigator agent was tested with 13 different tool configurations (combinations of web search, code execution, file access, memory, etc.) to determine which tools are most effective for uncovering hidden behaviors. Findings reveal that certain tools dramatically improve detection while others add noise, suggesting audit methodology matters as much as agent capability.

Connections

Influenced by

16. Sabotage Evaluations for Frontier Models

Jun 2024