PostgreSQL, that database system that sometimes sounds like a museum piece, is at the heart of ChatGPT and OpenAI's API. Surprised? I was too. OpenAI managed to support hundreds of millions of users with a single primary and nearly 50 read replicas, but it wasn't by accident: there were optimizations, rigorous engineering and hard practical decisions.
What happened and why it matters
In one year the load on PostgreSQL grew more than 10x. OpenAI needed to sustain millions of queries per second for 800 million users. The main strategy: keep a single-node primary for writes and offload reads to replicas. Why not split everything from the start? Because sharding PostgreSQL for existing applications means changing hundreds of endpoints and can take months or years.
But working with a single primary brings risks: write bursts, heavy queries and cache failures can saturate the primary and trigger retry cycles that make things worse. The story shows that with engineering and prudence, PostgreSQL can scale much farther than many thought, especially for read-dominated workloads.
Main challenges and the solutions they applied
-
Problem: intensive writes and
MVCCcausing write amplification and bloat.- Solution: they moved write-heavy, shardable workloads to sharded systems like Azure CosmosDB. For new workloads, the default rule is to use sharded systems. They also fixed bugs in the application that generated redundant writes and applied lazy writes and rate limits on backfills.
-
Problem: expensive queries that saturated CPU, for example joins across many tables.
- Solution: constant SQL optimization, avoiding multi-table joins when possible and moving complex logic to the application side. Reviewing what ORMs generate is key. They also applied timeouts like
idle_in_transaction_session_timeoutto avoid idle queries that block autovacuum.
- Solution: constant SQL optimization, avoiding multi-table joins when possible and moving complex logic to the application side. Reviewing what ORMs generate is key. They also applied timeouts like
-
Problem: single point of failure at the primary.
- Solution: move most reads to replicas; the primary runs in HA mode with a hot standby ready for quick promotion. That way, if the primary falls, critical reads keep being served and impact is reduced.
-
Problem: noisy neighbors consuming shared resources.
- Solution: isolate workloads on dedicated instances and segment by priority (high and low). This way, one feature that generates load doesn't degrade other functionality.
-
Problem: connection storms (limit of 5,000 on Azure PostgreSQL).
- Solution: they deployed
PgBouncerfor connection pooling. Practical result: average connect time dropped from 50 ms to 5 ms in their benchmarks. They also co-located proxies, clients and replicas by region to reduce latency and inter-region connection usage.
- Solution: they deployed
-
Problem: waves of cache misses that push many reads to the database.
- Solution: locking and leasing at the cache layer. Only one reader repopulates the cache on a miss while others wait. That prevents thousands of requests from hammering PostgreSQL at once.
-
Problem: WAL replication to many replicas puts pressure on the primary.
- Solution: try cascaded replication where intermediate replicas retransmit WAL to other replicas. This lets you scale replicas without overloading the primary, though it adds operational complexity and failover challenges.
-
Problem: sudden spikes that exhaust CPU, I/O or connections.
- Solution: rate limiting at multiple layers (application, pooler, proxy, query). Also block problematic query digests coming from the ORM when needed to relieve pressure quickly.
-
Problem: costly schema changes that rewrite entire tables.
- Solution: allow only lightweight changes, apply a strict 5-second timeout for schema operations, and move new tables to sharded systems. Backfills are done with rate limits and can take weeks, but they avoid production impact.
In short: they mitigate each threat with controls at multiple layers. There is no silver bullet: there are many small bullets well aimed.
Results and numbers that matter
- PostgreSQL load grew 10x in a year, but with optimizations they kept latencies low.
- They support millions of QPS for read-dominated workloads, with nearly 50 replicas distributed globally.
- Client p99 latency in the low single-digit milliseconds and five nines availability in production.
- Only one SEV-0 PostgreSQL incident in the last 12 months, during the viral launch of ChatGPT ImageGen when writes rose more than 10x and 100 million users signed up in a week.
- With
PgBouncer, average connection time fell from 50 ms to 5 ms in their tests.
Practical lessons you can apply if you run critical systems
-
Observability and limits: instrument query metrics and apply limits at multiple layers to cut a spike before it floods the database.
-
Review what ORMs generate: many surprises come from automatic SQL. Sometimes it's better to write the query by hand or split it.
-
Cache with control: using locking or leasing in the cache prevents read cascades when the cache fails.
-
Isolate by priority: separating critical from non-critical traffic reduces risk when you roll out new features.
-
Avoid aggressive schema changes in production: plan backfills and operations that can be applied in small steps.
What's next for OpenAI and for the technology in general
OpenAI continues migrating the easier-to-shard write loads to CosmosDB and is testing cascaded replication to avoid overloading the primary. Later they won't rule out sharding PostgreSQL or exploring alternative distributed systems if demand requires it.
If you have an architecture that today relies heavily on a relational database, the conclusion is clear: with design, discipline and the right tools you can stretch PostgreSQL a long way, but you also need escape routes to write when the load pattern demands it.
