llama.cpp integrates a router for model management

llama.cpp integrates a router for model management | Keryc

llama.cpp now includes a router mode that lets you load, unload and switch between models without restarting the server. Sound familiar to Ollama? Exactly: they bring a similar model management, but in the lightweight, OpenAI-compatible llama.cpp ecosystem.

What the new feature brings

The llama-server can be started in router mode simply by not specifying a model:

llama-server

In that mode the server performs auto-discovery of models in your llama.cpp cache (the LLAMA_CACHE variable or ~/.cache/llama.cpp) or in a folder you point to with --models-dir. Models previously downloaded with llama-server -hf user/model will appear automatically.

The architecture is multiprocess: each model runs in its own process. Why does that matter? Because if one model crashes, the others keep running. It also supports on-demand loading: the first request that targets a model loads it into memory; subsequent calls are instantaneous because the model is already loaded.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

curl http://localhost:8080/models

curl -X POST http://localhost:8080/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

curl -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

llama-server --models-dir ./my-models

llama-server --models-dir ./models -c 8192 -ngl 99

llama-server --models-preset config.ini

[my-model]
model = /path/to/model.gguf
ctx-size = 65536
temp = 0.7

What the new feature brings

How to use it (commands and examples)

Architecture and technical settings

Use cases and practical recommendations

Final thoughts

Original source

Stay up to date!

llama.cpp integrates a router for model management