How OpenAI scales PostgreSQL for 800M users

PostgreSQL, that database system that sometimes sounds like a museum piece, is at the heart of ChatGPT and OpenAI's API. Surprised? I was too. OpenAI managed to support hundreds of millions of users with a single primary and nearly 50 read replicas, but it wasn't by accident: there were optimizations, rigorous engineering and hard practical decisions.

What happened and why it matters

In one year the load on PostgreSQL grew more than 10x. OpenAI needed to sustain millions of queries per second for 800 million users. The main strategy: keep a single-node primary for writes and offload reads to replicas. Why not split everything from the start? Because sharding PostgreSQL for existing applications means changing hundreds of endpoints and can take months or years.

But working with a single primary brings risks: write bursts, heavy queries and cache failures can saturate the primary and trigger retry cycles that make things worse. The story shows that with engineering and prudence, PostgreSQL can scale much farther than many thought, especially for read-dominated workloads.

What happened and why it matters

Main challenges and the solutions they applied

Results and numbers that matter

Practical lessons you can apply if you run critical systems

What's next for OpenAI and for the technology in general

Original source

Stay up to date!

How OpenAI scales PostgreSQL for 800M users