Reachy Mini runs local conversations with AI

May 27, 20264 minutes

Seriously, do you want your robot to speak and think without sending anything to the cloud? Now you can: Reachy Mini can converse entirely locally using the Hugging Face speech-to-speech stack. Audio that never leaves your network, no API keys, no usage costs. Sounds good, right?

What this article does

I guide you step by step to bring up a local voice backend for Reachy Mini using a VAD -> STT -> LLM -> TTS cascade that exposes a WebSocket compatible with /v1/realtime. It’s a technical approach, but practical: command examples, why to choose certain models, and how to optimize latency so conversations feel fluid.

The core idea: speech-to-speech cascade

A cascade splits the pipeline into four clear stages: VAD (voice activity detection), STT (transcription), LLM (reasoning / dialogue) and TTS (output voice). That gives you full control: swap the VAD, try another STT, raise or lower TTS quality. The advantage? Privacy, zero API costs and flexibility to improve parts as new models appear.

Hugging Face recommends a combination that works very well today:

VAD: Silero VAD v5 - small, accurate and CPU-friendly.
STT: Parakeet-TDT - streaming-friendly and fast.
LLM: local options with llama.cpp, vLLM, MLX or transformers.
TTS: Qwen3-TTS - expressive, multilingual and low-latency.

Bringing up the LLM server (example with llama.cpp)

If you go with llama.cpp use the llama-server binary. Installing is simple (brew or winget), then run:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

What do those flags do?

-hf ggml-org/gemma-4-E4B-it-GGUF - downloads the model from the Hub the first time and then uses it from cache.
-np 2 - two parallel slots to handle interruptions without blocking the first request.
-c 65536 - 64k context window shared between slots, useful for long conversations.
-fa on - flash attention: faster and less memory on modern hardware.
--swa-full - keeps a sliding-window attention cache to process prompts faster, at the cost of some RAM.

The first run takes time to download the model; after that it starts quickly.

Start the local voice engine (speech-to-speech)

Install the library and start the engine in local mode while the LLM is running in another terminal:

pip install speech-to-speech

speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local

The first run will download Parakeet and Qwen3-TTS; subsequent starts are fast. The CLI spins up a WebSocket at /v1/realtime that Reachy Mini knows.

Realtime mode with a separate backend (Responses API)

If you prefer the LLM to be in another process or machine, speech-to-speech speaks the Responses API protocol. Example with llama.cpp server + speech-to-speech client:

Terminal 1: llama.cpp

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

Terminal 2: speech-to-speech client in realtime mode

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "unsloth/Qwen3-4B-Instruct-2507-GGUF" \
  --responses_api_base_url "http://127.0.0.1:8080/v1"

For lower latency on more advanced systems vLLM is recommended. When using vLLM you’ll want three almost-mandatory flags:

--enable-auto-tool-choice
--tool-call-parser <tool_parser_name> - choose the parser by model family (for example qwen3_coder or llama3_json).
--default-chat-template-kwargs '{"enable_thinking":false}' - disables the thinking channel that generates inner-thought tokens the user perceives as silence.

Example vLLM server for Qwen3-4B-Instruct-2507 with MTP (Multi-Token Prediction):

vllm serve Qwen/Qwen3-4B-Instruct-2507 \
  --port 8000 \
  --host 127.0.0.1 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --default-chat-template-kwargs '{"enable_thinking":false}' \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'

MTP helps a lot with end-to-end latency; enable it if the model supports it.

speech-to-speech client pointing to vLLM:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3-4B-Instruct-2507" \
  --responses_api_base_url "http://127.0.0.1:8000/v1"

If you prefer a managed endpoint or external providers, the same --responses_api_base_url flag adapts: change the URL and add the key --responses_api_api_key when needed.

Other LLM backend options

MLX on Mac: if you’re on an M‑chip Mac the mlx-lm option usually gives the best experience with low latency. Example:

speech-to-speech --llm_backend mlx-lm --model_name "mlx-community/Qwen3-4B-Instruct-2507-bf16"

Transformers on CUDA/Linux: if you have a GPU on Linux and want to switch models without converting weights, use --llm_backend transformers.
Use a provider: point --responses_api_base_url to the provider’s endpoint or to Hugging Face’s router to try larger models without your own infra.

Connect Reachy Mini from another machine

If the voice engine runs on your laptop and Reachy Mini is on the network, make sure the server listens on a LAN address and not only on 127.0.0.1. In the Reachy conversation app you select your laptop’s IP.

How to find your IP:

macOS: ipconfig getifaddr en0 (wifi) or ipconfig getifaddr en1 (ethernet)
Linux: hostname -I
Windows: ipconfig and look for IPv4 Address

Use an IP like 192.168.x.x or 10.x.x.x. If you see 169.254.x.x you are not on the network.

Trade-offs and practical optimizations

Each stage has trade-offs. You want the lowest latency possible so the robot responds naturally, but also good voice and transcription quality. Quick recommendations:

Optimize the LLM for latency: small well-tuned models (Qwen3-4B, small Gemma) or MTP in vLLM.
If you operate in a single language, prioritize STT/TTS models optimized for that language to gain quality.
Keep enable_thinking=false if you want fluid conversation without long pauses from internal reasoning.

In my experience testing this stack on an M1 Mac and a GPU laptop, MLX felt instant on the Mac, while vLLM with MTP cut server GPU latency significantly. The nice part is you can try and swap pieces without redoing the whole architecture.

To finish

If you care about privacy and control over your robot’s conversational experience, running everything locally is a pragmatic leap. It’s not magic: these are concrete components you can install, test and replace when better models arrive. Which combination will you try first?

Original source

https://huggingface.co/blog/local-reachy-mini-conversation

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.