Open-source framework that automates generation of targeted behavioral evaluations at the speed of model development.
Research PaperAgentic framework for generating targeted behavioral evaluations. Automates evaluation development for researcher-specified traits, leveraging advanced model capabilities to scale safety testing.
Bloom is an example of building infrastructure that solves an internal problem first, then releasing it. Anthropic needed faster evaluation generation; they built it as an agentic tool; the tool became valuable enough to release publicly. This pattern shows how dogfooded internal tools often become the best products because they're built on genuine operational necessity, not theoretical demand.
Bloom dramatically improves the DX for safety researchers. Instead of manually writing hundreds of test cases, researchers specify the trait to evaluate, and Bloom generates the tests. This transforms evaluation from a laborious manual task to a high-level specification task, freeing researchers to focus on which behaviors matter rather than how to test them.
Bloom exemplifies using AI to accelerate engineering tasks. Advanced language models generate evaluation test cases; humans specify intent at a higher level. This is not replacing engineers but amplifying them — the same pattern applies across code generation, documentation, testing, and other engineering domains where AI can handle the low-level synthesis while humans focus on high-level direction.