Bloom: open-source tool to evaluate AI behaviors | Keryc
Bloom is a toolbox for researchers who want to measure, quickly and at scale, problematic behaviors in frontier AI models. Why does this matter now? Because manual evaluations are slow, they go out of date, and they can even pollute future training data.
Bloom automates scenario generation and scoring so you can quantify the frequency and severity of a behavior you define. That means you get repeatable numbers instead of handfuls of hand-checked examples — and you can iterate much faster.
Qué es Bloom y para qué sirve
Bloom is an open-source, agent-based framework that turns a behavior description and a seed configuration into a full evaluation suite. Instead of relying on a fixed set of examples, Bloom generates multiple scenarios per run, measures the same behavior, and preserves reproducibility via a seed (configuration file).
Bloom is designed to test concrete traits. In the launch example they tried four alignment-relevant behaviors — hallucination and flattery, long-term instructed sabotage, self-preservation, and self-preferential bias — across 16 models. Results come back in days, not months, and include top-level metrics (elicitation rate, mean presence) as well as exportable transcripts.
Cómo funciona (arquitectura de 4 etapas)
Bloom runs four automated agents that transform input into an evaluation suite:
Understanding: analyzes your behavior description and examples to define what to measure and why.
Ideation: generates scenarios designed to trigger the target behavior. Each scenario includes situation, simulated user, system prompt, and interaction context.
Rollout: executes scenarios in parallel; an agent simulates both the user and any tools to elicit the target model's response in multi-turn conversations.
Judgment: a judge model scores each transcript for presence of the behavior and secondary criteria; then a meta-judge produces suite-level analysis.
You can pick which models run each stage, adjust interaction length and modality (for example whether you expose tools to the model), control scenario diversity, and add secondary criteria like realism or elicitation difficulty. Bloom exports transcripts compatible with Inspect and integrates with Weights & Biases for large runs. The repo includes an example seed file to get you started.
Reproducibilidad y configurabilidad
Bloom generates different scenarios on each run unless you use static evaluations. Reproducibility comes from citing the seed that defines: the behavior description, examples, and key parameters. That lets you iterate locally on samples until the evaluation captures what you want, then sweep models at scale.
Validación: ¿en qué tanto confías en Bloom?
Anthropic validated Bloom with two key questions:
Does it separate models with different behavioral tendencies? Yes: when evaluating production models against intentionally misaligned 'model organisms', Bloom correctly separated 9 out of 10 quirk tests; in the tenth case, manual review showed the base model had similar rates.
How well does the automated judge align with human judgments? Comparing 40 hand-annotated transcripts against 11 judge models, Claude Opus 4.1 showed the highest Spearman correlation with human judgment (0.86), followed by Claude Sonnet 4.5 (0.75). Opus 4.1 also agrees better at the extremes, which is useful when you use thresholds to decide presence or absence of a behavior.
These metrics aren't an absolute guarantee, but they indicate Bloom can replicate human judgments in many cases and distinguish models with real behavioral differences.
Caso práctico: sesgo autopreferencial
In the self-preferential bias study (a model's tendency to prefer itself when making choices), Bloom reproduced the model ranking from Claude Sonnet 4.5's system card: Sonnet 4.5 showed the least bias among tested models. But Bloom also let researchers go further: increasing the judge's "reasoning effort" reduced bias, with the biggest drop between medium and high effort.
Interesting detail: the reduction didn't come from Sonnet randomly picking other models. Instead, it started to recognize the conflict of interest and declined to evaluate its own option.
They also found that filtering rollouts with unwanted traits (for example, lack of realism or awareness of the evaluation) improves both elicitation rate and judgment quality. And while absolute metrics change with configuration (number of examples, conversation length, judge effort), relative model rankings tend to remain stable.
Cómo empezar y buenas prácticas
Clone the repo and review the example seed to understand the structure.
Iterate locally: generate samples, review scenarios, and tweak the behavior description until the samples reflect what you want to measure.
Choose models for each stage carefully: the right judge is crucial for correlating with humans.
Control tool exposure and interaction length if you want to test behaviors that only appear with tool access or long dialogues.
Document and share the seed when you publish metrics so others can reproduce your measurements.
Limitaciones y riesgos a considerar
Judge calibration: while some judges correlate well with humans, not all do; validate with manual annotations for critical cases.
Contamination and evolution: automated evaluations save time, but generated scenarios can eventually be exploited or reflect biases in the procedure itself.
Realism of simulations: rollouts depend on how faithfully agents simulate users and tools; biases there affect results.
Not a silver bullet: Bloom is great for measuring frequencies and comparing models, but it still needs investigative design and review to support strong interpretations.
Bloom is already used to study nested jailbreak vulnerabilities, hardcoding, evaluation awareness, and traces of sabotage. If you work on alignment, it's a practical tool to speed evaluation cycles and dig into why models behave the way they do.
Thinking of evaluations as dynamic processes rather than fixed tests changes how we measure risk. Bloom proposes exactly that: generate and measure systematically, with configuration and reproducibility. Ready to try it and see what new behaviors you discover in your models?