A short while ago the Mistral team spotted a memory leak that only showed up in a very specific scenario: vLLM with Prefill/Decode disaggregation, using the Mistral Medium 3.1 model and graph compilation enabled. There were no errors or crashes, just a linear rise in memory of about 400 MB per minute until the process ran out of memory.
Does that sound scary? Yes. Impossible to investigate? Not at all. This story walks you through how they got to the root of the issue: from Python-level profiling to kernel-level traces, and how a low-level dependency (UCX) turned out to be the culprit.
What happened and why it mattered so much?
The pattern was odd: the leak showed only in the decode side of the Prefill/Decode split, and only when the KVCache transfer path went through NIXL. That pointed to the memory-transfer route as the origin.
Prefill/Decode works like this, at a high level:
The router sends a prefill request to compute the KVCache.
That KVCache is then transferred to a decode worker that generates tokens by extending that KVCache.
The KVCache transfer happens via NIXL, which depends on UCX for fast communications (RDMA, Infiniband, etc.).
If this sounds technical, the practical lesson is simple: when you move large blocks of memory between processes and use high-performance libraries, there are more places where things can go wrong.
How they investigated it (without getting lost in jargon)
They started with Python profiling tools: Memray, Guppy 3, and then Heaptrack. Curiously, Heaptrack showed the heap was stable, but the RSS (resident set size) kept growing. How is that possible? Because RSS includes more than the heap: it also covers anonymous mmap regions, large memory pages, and memory managed outside glibc.
To see maps in real time they used pmap repeatedly (with watch) and noticed certain anonymous regions growing and moving addresses—clear signs of mmap/mremap activity or cycles of munmap + mmap that didn’t actually free memory.
Heaptrack only catches glibc malloc/free, so they needed to go one level deeper.
BPFtrace and why it was useful
With a BPFtrace script they traced mmap, munmap and mremap at the syscall level. That trace showed the calls came from syscall+29, meaning from a wrapper doing raw syscalls directly into the kernel. That indicated the library making the calls was bypassing the usual routes and our hooks.
Still, BPFtrace didn’t give the full user-stack needed to identify the ultimate responsible party—only the trace up to the syscall instruction.
GDB with conditional breakpoints: the final trick
As a pragmatic solution they set a very targeted breakpoint on the syscall instruction that only triggered when the syscall number was SYS_mmap. By capturing the output and the full stack at those moments they could correlate addresses returned by mmap with the growing regions seen in pmap.
That full trace pointed to the culprit: UCX (via UCM/ucs) was calling mmap from its internal hooks. At times munmap inside UCX triggered operations that in turn caused more mmap. In short: UCX’s internal memory management (its registration cache, or RCache) accumulated regions and didn’t release them immediately.
Why was UCX doing this?
UCX optimizes RDMA transfers by registering (pinning) memory pages so the NIC can access them without CPU intervention. To do that, UCX patches GOT entries at runtime and adds hooks to mmap/munmap to control and speed up memory registration. It’s powerful, but it breaks assumptions of debugging tools and can interfere with simple hooks.
Also, UCX doesn’t free regions immediately: it places them on an invalidation queue managed by its memory pool. If that queue grows without bounds (default value inf), the process keeps requesting more mmap and never recovers RSS.
How they fixed it (immediate and deeper fixes)
The good news: there were clear, safe fixes for this particular use case.
Quick fix: disable UCX’s mmap hook with the environment variable UCX_MEM_MMAP_HOOK_MODE=none. This removed the leak without hurting performance in the vLLM/NIXL flow, because that flow only needs to register one large contiguous region (the KVCache) once.
Another useful measure: limit the queue of unreleased regions with UCX_RCACHE_MAX_UNRELEASED=1024 (the default was infinite). This forces UCX to perform periodic cleanups and prevents unbounded accumulation.
In the medium term, vLLM and the maintainers of NIXL/UCX agreed to adjust defaults and change behaviors to reduce the chance of this pattern recurring.
Practical lessons for teams running inference at scale
Don’t rely only on high-level profilers: Heaptrack, Memray or Guppy are helpful, but you must look at RSS, pmap and syscalls when the growing memory isn’t on the heap.
High-performance libraries (UCX, RDMA stacks, GPU managers) can patch functions at runtime. That’s great for speed, but makes debugging harder. How do you handle that? Keep kernel-level traces (BPFtrace/strace) and have strategies to get full stacks when BPFtrace falls short (targeted GDB use).
If you see mmap growing in RSS, think about registrants/memory pools and internal queues in libraries: the issue might live outside your main Python or C code.
Collaborate with library maintainers. In this case, investigation and coordination with vLLM, NIXL and UCX were key to producing patches and better defaults.
Final reflection
This investigation is a reminder that modern inference infrastructure is a layered stack: every dependency brings optimizations that help—and sometimes introduce new failure modes. Taking the time to go down to the kernel, correlate pmap + BPFtrace + GDB, and understand a library’s intent is what turns a temporary workaround into a robust fix.
If you manage deployments of large, disaggregated models, keep these tools and patterns handy: they can save you days of frustration next time.