Anthropic empowers agents in biology with gget virus | Keryc
Anthropic proposes a practical shift: for AI agents to be useful in biology, biological data infrastructure must become "agent-friendly." Using a real case — retrieving viral sequences from NCBI Virus — they show models can understand the tasks but fail because the ecosystem is fragile. By adding a deterministic layer called gget virus, accuracy and reproducibility jump to almost 100%.
The problem: medieval streets for cars of the future
Can you imagine driving a race car through the alleys of a village built before cars existed? That’s the analogy Anthropic uses: biological databases were designed for humans who click, not for agents that run large-scale workflows.
The key points are clear:
Idiosyncratic formats, filters exposed only in web interfaces, and inconsistent metadata make programmatic retrieval fragile.
Small errors in data extraction can ruin later analyses: coordinates in the wrong build, mixing RefSeq and GenBank, confusing segments in segmented viruses, or losing records because metadata isn’t standardized.
For scientific tasks the bar is effectively 100%: a bad extraction can bias estimates of outbreak origin, diagnostic coverage, or therapy assessment.
The experiment: VirBench and the street test
Anthropic and team built VirBench, a benchmark with 120 realistic queries across 40 pathogens. The questions mimic surveillance tasks, diagnostic assay design, and building training datasets for protein models. One example: retrieving sequences of Orthoebolavirus zaireensis with simultaneous filters for host, geographic region, date windows, minimum length, and a maximum count of ambiguous N bases.
Results without a deterministic layer were mixed:
Models evaluated: Claude Sonnet 4, Claude Opus 4.7, Biomni, Edison Analysis, GPT-5.2-pro, GPT-5.5.
Mean observed accuracy: between 16.9% and 91.3%, depending on model and query.
Poor reproducibility: the same query repeated returned very different answers (e.g., Sonnet 4 returned 106, 15 and 5 sequences across three identical runs).
Concrete consequence: phylogenetic trees based on incomplete extractions shifted the estimated time to the most recent common ancestor (TMRCA) from January 2014 to absurd years like 1922 in some cases. That alters epidemiological hypotheses and public decisions.
The intervention: gget virus, a deterministic layer
To fix the fragility they built gget virus, a tool that replicates the logic of the NCBI Virus web interface but programmatically and reproducibly. It wasn’t just calling an API: NCBI Virus is a portal that aggregates REST, Datasets, E-utilities and internationally synchronized sources.
How it works, in practical terms:
It coordinates calls to REST, Datasets and E-utilities to reproduce the semantic filtering humans get in the browser.
It decides which filters can be delegated to APIs and which must be applied locally after downloading relevant records.
It handles batching and pagination to avoid arbitrary cuts in large collections like SARS-CoV-2 or Influenza A.
It reconciles identifiers and preserves relevant GenBank information in the final output.
It returns standardized, detailed outputs with logs that let you audit how the final set was produced.
The result was striking: with gget virus available to agents, accuracy rose above 90% for all systems and reached 99.7% in GPT-5.5. Run-to-run variability practically vanished and differences between models shrank noticeably.
Technical lessons and practical recommendations
Determinism where it matters
The creative engines of models should coexist with deterministic layers for data retrieval, normalization and logging. That ensures reproducibility, auditable and verifiable by scientists.
Design APIs with agents in mind
Expose programmatic filtering equivalent to the web interface, well-documented metadata, persistent identifiers and endpoints for robust pagination.
Connectors, harnesses and tests
Implement SDKs and connectors (like gget virus) that encapsulate reconciliation and batching logic.
Add test suites and benchmarks (e.g. VirBench-like) to validate that a connector reproduces the expected semantics.
Record and version
Provenance metadata, filtering logs and API versions should accompany every extraction to allow audit and reproduction.
Cost, trust and maintenance
Even as models improve, connectors remain valuable: they’re cheaper, faster and easier to audit than retraining models or relying on ad hoc reasoning each run.
Implications for science and public health
This isn’t just an academic discussion. In real outbreaks, like the Bundibugyo event in the Democratic Republic of the Congo mentioned in the report, the difference between being able to automate retrieval of historical genomes correctly or not can speed diagnostics, validate therapies and clarify the origin of the event.
Also, it democratizes access: with deterministic layers, you don’t need the most expensive model to get correct data. Researchers in lower-resource settings can run reproducible workflows without depending on the latest frontier models.
Final reflection
Anthropic shows the bottleneck isn’t only model reasoning ability: it’s the lack of deterministic, machine-oriented infrastructure. If we want agents to help with discovery, outbreak response and drug design, we need to build paved streets for them: coherent APIs, auditable connectors and standardized metadata.
The good news? Some of that infrastructure already exists in bioinformatics libraries and tools, and gget virus is a practical example of how to piece things together. The task? Scale those approaches, standardize interfaces and start treating agents as primary users from now on.