DR Tulu launches open recipe for deep research | Keryc
DR Tulu is an open bet for models to do deep research: plan, search and synthesize information from many sources to produce long, justified answers with clear citations. Sounds complex? Yes. Is it useful today? Also.
What DR Tulu is and why it matters
DR Tulu is the first open model trained specifically for long-form research tasks using an end-to-end recipe that combines SFT (supervised fine-tuning) and a new variant of RL they call RLER (Reinforcement Learning with Evolving Rubrics). The main idea is to train agents that don’t just answer, but investigate: plan, call search tools, gather evidence and document each claim with verifiable citations.
Why is this relevant now? Because many powerful research agents are proprietary. DR Tulu proposes a reproducible alternative: model, code, agent library and the full recipe under a permissive license.
Practically speaking: DR Tulu-8B ties with or outperforms several proprietary agents on long-research benchmarks, at a cost per query thousands of times lower.
How it works: agent, tools and MCP
At inference, DR Tulu runs a loop of self-search and chooses between three actions: think to plan, call_tool to invoke searchers or browsers, and answer to produce the final response.
Final answers include citation tags that link to the sources used. That makes it easier to audit the agent’s steps and verify that claims are actually grounded.
To handle different domains, DR Tulu uses the Model Context Protocol or MCP, which treats tools as interchangeable components. In the default setup it offers:
google_search for web snippets
web_browse to extract full text from pages
paper_search for relevant paragraphs from open-access papers
Thanks to dr-agent-lib you can swap or add searchers, private databases or domain-specific readers without retraining the model.
The training recipe: SFT + RLER
Training for long-form research has two key challenges: there isn’t a single “correct” answer, and static rubrics break when the model learns to exploit their weaknesses. DR Tulu solves this in two phases.
SFT to get started cold
Before RL, they apply SFT with trajectories generated by GPT-5. Those trajectories include simulated chain-of-thought steps, tool calls and final responses with formatting and citations.
The goal is for the model to learn the protocol: when to call tools, how to structure an answer and how to cite. That prevents RLER, if applied from scratch, from resulting in poor exploration or ineffective tool calls.
RLER: rewards that evolve
RLER adapts the reward function during training along three axes:
Instance-specific rubrics anchored in search. For each question a rubric is generated based on retrieved context.
Positive and negative rubrics that evolve. New good strategies are promoted and failure or hacking modes are penalized (for example pasting retrieved text without synthesizing).
Dynamic buffer of rubrics and auxiliary rewards for format and faithful citation.
They also use a variant of GRPO supporting multiple rollouts and asynchronous training. Execution lets generation and search overlap: if one trajectory makes a call it pauses while others continue, optimizing API usage.
Results on benchmarks and efficiency
They evaluated DR Tulu-8B (RL) on seven benchmarks (four long synthesis and three short-answer) and the results are solid:
ScholarQA-CSv2: 86.7 (DR Tulu-8B) versus 42.5 for WebExplorer-8B and 32.9 for WebThinker-32B-DPO.
ResearchQA and DeepResearch Bench: 71.1 and 41.8 respectively, with clear improvements over open baselines.
On ScholarQA-CSv2 DR Tulu reaches a citation precision of 90.6 and recall of 76.1, which indicates not only complete answers but better grounding in the literature.
And the cost? A typical evaluated query reports about $0.00008 in external calls. Even with a max of 10 searches, the approximate cap was $0.0075 per query. By comparison, proprietary agents reach $1 or more per query for these tasks.
Clinical case: genetics and current limits
Tested on a realistic set about genetic variants (GeneticDiseasesQA), DR Tulu beats several open alternatives and some proprietary services in evidence synthesis and answer quality. In synthesis evidence it even competes well with GPT-5 + Search.
However, HealthBench — demanding medical queries — remains a challenge. DR Tulu improves on other open options, but there’s still room for expert clinical advice.
Practical design: reproducibility and use
They release everything: the DR Tulu-8B checkpoint, training code, RLER and dr-agent-lib. This allows you to:
Reproduce the experiments and study how rewards and tools change behavior.
Deploy the agent with your own MCP tools and privacy policies.
Extend the library for specific domains without retrain, by describing new tools to the agent.
If you want to try it in your domain, a typical route is: adjust the MCP tool stack, curate representative questions, and evaluate with specific rubrics before running RL.
Final reflection
DR Tulu shows it’s possible to train open agents for long-form research with a good balance of quality, cost and traceability. The key is combining an SFT-guided warm start with rewards that evolve according to what the agent actually discovers.
Does this mean AI-assisted research solves everything now? No. There are still challenges in clinical domains and in ensuring ethics and robustness. But you now have an open recipe to experiment and improve, not just a proprietary black box.