Datadog uses Codex to review code and prevent incidents | Keryc
Datadog integrates Codex into its code reviews to look beyond the diff and prevent failures in distributed systems. Why does this matter to you, whether you work at a startup or on a critical platform? Because code review shifted from being a one-off filter to becoming a proactive reliability system.
How Codex made it into Datadog's reviews
Datadog's AI DevX team experimented with Codex by plugging it directly into the workflow: every pull request in one of their largest repositories received an automatic review. Engineers responded with thumbs up or down and left informal feedback in Slack.
What made it different from previous tools? Many engineers said Codex produced comments worth reading, unlike the noisy or superficial suggestions they were used to.
What Codex did differently
Datadog found that Codex didn’t stop at style issues. Instead, it provided system-level reasoning:
It pointed out interactions with modules that weren't even in the diff.
It noticed missing test coverage at service coupling points.
It identified changes in API contracts that risked cascading effects.
For engineers, a Codex comment felt like "the smartest engineer with infinite time to find bugs." It sees connections a person can't hold in their head at a single glance.
Unlike static analyzers, Codex compares the intent of the pull request with the submitted code and reasons about the codebase and its dependencies — even running tests to validate behaviors.
The test that convinced them: the incident replay
Datadog didn't stop at opinions. They built a replay harness using real incidents: they reconstructed pull requests that had contributed to past failures and ran Codex as if it were the original review.
Concrete outcome: Codex found more than 10 cases — roughly 22% of the incidents analyzed — where its feedback would have made a difference. In other words, it detected risks that human reviews and traditional tools missed at the time.
That confirms something important: Codex complements human review, it doesn't replace it. It surfaces problems that aren't obvious from the immediate diff and that don't fit deterministic rules.
Impact on workflow and culture
After evaluation, Datadog rolled out Codex at larger scale. Today more than 1,000 engineers use it regularly. The impact goes beyond time savings:
Surfacing risks a single reviewer can't keep in context.
Detecting interactions between modules and services that go unnoticed.
Greater confidence when deploying at scale.
Freeing human reviewers to focus on architecture and design.
Teams post useful examples in Slack and share moments when Codex changed how they thought about a problem. It changed the definition of code review: it’s no longer just a checkpoint for bugs, it’s part of the reliability system.
Final reflection
If you work with distributed systems, this isn't a fad. Integrating code agents that reason at the system level can turn review into a preventive tool, not a reactive one. Does this mean AI will replace senior engineers? No. It means AI can amplify their reach, surface risks that were previously missed, and help build more reliable software.