Agentic framework that automatically detects and mitigates safety failures with minimal human intervention
Research PaperIntroduced A3 (Automated Alignment Agent), an agentic framework that automatically detects safety failures in language models, generates targeted training data, fine-tunes the model to fix the failure, and logs experiments — all with minimal human intervention. Multi-stage pipeline: safety failure detection → data generation agent → finetuning agent → experiment log → verification. Being open-sourced. Represents the frontier of automated safety — using AI to align AI.
A3 operates in five stages: (1) Safety failure detection—automatically identifying behaviors that violate alignment objectives, (2) Data generation—an agent creates targeted training examples to address detected failures, (3) Fine-tuning—automatically retraining the model on new safety data, (4) Experiment logging—systematic tracking of all modifications and their effects, (5) Verification—testing whether the fix actually eliminated the failure without introducing new problems. The entire pipeline runs with minimal human involvement.
Rather than generic safety data, A3's data generation agent creates examples specifically targeting the detected failure mode. If a model exhibits sycophancy, the agent generates training examples showing how to disagree respectfully. If a model shows oversight subversion, the agent creates examples showing transparency and cooperation. This targeted approach is more efficient than broad safety training.
After each fine-tuning iteration, A3 automatically verifies whether the fix worked and didn't introduce regressions. All experiments are logged with reproducible seeds, training data, model checkpoints, and evaluation results. This enables systematic study of what works and what doesn't in AI alignment.