Safety·Anthropic·Mar 2026

61. A3: Automated Alignment Agent

Agentic framework that automatically detects and mitigates safety failures with minimal human intervention

Research Paper

Summary

Introduced A3 (Automated Alignment Agent), an agentic framework that automatically detects safety failures in language models, generates targeted training data, fine-tunes the model to fix the failure, and logs experiments — all with minimal human intervention. Multi-stage pipeline: safety failure detection → data generation agent → finetuning agent → experiment log → verification. Being open-sourced. Represents the frontier of automated safety — using AI to align AI.

Key Concepts

Multi-Stage Alignment Pipeline

A3 operates in five stages: (1) Safety failure detection—automatically identifying behaviors that violate alignment objectives, (2) Data generation—an agent creates targeted training examples to address detected failures, (3) Fine-tuning—automatically retraining the model on new safety data, (4) Experiment logging—systematic tracking of all modifications and their effects, (5) Verification—testing whether the fix actually eliminated the failure without introducing new problems. The entire pipeline runs with minimal human involvement.

Targeted Training Data Generation

Rather than generic safety data, A3's data generation agent creates examples specifically targeting the detected failure mode. If a model exhibits sycophancy, the agent generates training examples showing how to disagree respectfully. If a model shows oversight subversion, the agent creates examples showing transparency and cooperation. This targeted approach is more efficient than broad safety training.

Automated Verification and Experiment Logging

After each fine-tuning iteration, A3 automatically verifies whether the fix worked and didn't introduce regressions. All experiments are logged with reproducible seeds, training data, model checkpoints, and evaluation results. This enables systematic study of what works and what doesn't in AI alignment.

Connections

Influenced by

40. Bloom: Open Source Tool for Automated Behavioral Evaluations

Dec 2025