For years the community has asked: can large language models do real science, not just answer textbook questions? Anthropic published a technical study that looks for that answer with BioMysteryBench, a benchmark designed for complex bioinformatics tasks on real data.
Qué es BioMysteryBench y por qué importa
BioMysteryBench is a set of 99 bioinformatics problems created by experts from real or minimally processed data (WGS, scRNA-seq, metagenomics, ChIP-seq, Hi-C, methylation, proteomics, and metabolomics). Each question comes with a validation notebook that shows the signal exists in the data, even if finding it from scratch can be hard.
The key idea is to measure research tasks that reflect real work: reading data, installing tools (pip, conda), querying databases like NCBI and Ensembl, writing and running analyses, and justifying conclusions. It’s not just about knowing the answer; it’s about reproducing the scientific process.
