Mistral fixes memory leak in vLLM caused by UCX

A short while ago the Mistral team spotted a memory leak that only showed up in a very specific scenario: vLLM with Prefill/Decode disaggregation, using the Mistral Medium 3.1 model and graph compilation enabled. There were no errors or crashes, just a linear rise in memory of about 400 MB per minute until the process ran out of memory.

Does that sound scary? Yes. Impossible to investigate? Not at all. This story walks you through how they got to the root of the issue: from Python-level profiling to kernel-level traces, and how a low-level dependency (UCX) turned out to be the culprit.

What happened and why it mattered so much?

The pattern was odd: the leak showed only in the decode side of the Prefill/Decode split, and only when the KVCache transfer path went through NIXL. That pointed to the memory-transfer route as the origin.

Prefill/Decode works like this, at a high level:

The router sends a prefill request to compute the KVCache.

What happened and why it mattered so much?

Prefill/Decode works like this, at a high level:

The router sends a prefill request to compute the KVCache.

What happened and why it mattered so much?

What happened and why it mattered so much?

How they investigated it (without getting lost in jargon)

BPFtrace and why it was useful

GDB with conditional breakpoints: the final trick

Why was UCX doing this?

How they fixed it (immediate and deeper fixes)

Practical lessons for teams running inference at scale

Final reflection

Original source

Stay up to date!

Mistral fixes memory leak in vLLM caused by UCX