Gemma 4 runs VLA on Jetson Orin Nano Super | Keryc
This tutorial shows how to run a VLA (Vision-Language Assistant) demo with Gemma 4 on an NVIDIA Jetson Orin Nano Super 8 GB. What's the standout bit? The whole flow is local, and the decision of when to look comes from the model — not from hardcoded rules. I'll walk you through the technical steps so you can replicate it without getting lost in unnecessary jargon.
Qué hace esta demo VLA
The pipeline is simple: you speak -> Parakeet STT -> Gemma 4 -> (if needed) take a photo with the webcam -> Kokoro TTS -> speaker. You press SPACE to start recording, and SPACE again to stop. But what does VLA mean here? It means the model decides, by context, whether it needs to see something. No keywords, no rigid logic.
If you ask something that requires seeing the environment, Gemma 4 calls the look_and_answer tool, the demo captures an image, sends it to the model, and Gemma replies using what it saw. It doesn’t describe the photo for the sake of describing it; it answers your question with that visual context.
Key point: the demo runs on an 8 GB Jetson Orin Nano. Impressive, right? But you do need to prepare the board.
Hardware y software que se usó
NVIDIA Jetson Orin Nano Super (8 GB)
Logitech C920 webcam (built-in mic)
USB speaker
USB keyboard (to press SPACE)
You’re not tied to those exact models: any webcam, USB mic and speaker that Linux recognizes should work.
Preparación del sistema (resumen técnico)
First, update and install basic packages for audio, webcam and Python development. Key commands:
Before starting the server, stop heavy processes like Docker if you don't need them, and kill services like tracker-miner-fs-3 or gnome-software if they're eating RAM.
Modelos, quantizaciones y por qué elegir Q4_K_M
From testing, Q4_K_M offers the best balance between capability and memory use on 8 GB. If you still run out of memory, you can drop to the Q3 version (file name includes Q3) to save RAM at the cost of some accuracy. But the recommendation is: try Q4_K_M first.
You also need the mmproj file which is the vision projector. Without it, Gemma can’t "see".
Build nativo de llama.cpp y lanzamiento de llama-server
You build a native llama.cpp for better performance and support for the vision projector. Essential steps:
The Gemma4_vla.py script (one single file) orchestrates everything: local STT (Parakeet), the call to llama-server and TTS (Kokoro). You can clone the full repo or download just the script:
Clone repo:
git clone https://github.com/asierarranz/Google_Gemma.git
cd Google_Gemma/Gemma4
On first run the script downloads Parakeet STT, Kokoro TTS and generates voice WAVs. Then: press SPACE to record, speak, press SPACE to stop and Gemma answers.
There’s also a text-only mode if you want to test just the LLM part without audio:
python3 Gemma4_vla.py --text
The demo defines exactly one tool exposed to Gemma 4:
{
"name": "look_and_answer",
"description": "Take a photo with the webcam and analyze what is visible."
}
When the model decides it needs vision, it invokes look_and_answer, the demo takes the photo, sends it to the server and the response is synthesized with Kokoro.
The magic here is the --jinja flag in llama-server, which enables native tool-call support in Gemma.
Docker: opción más rápida pero limitada
If you don’t want to compile everything, NVIDIA published a prebuilt Docker image for Orin. It’s one line:
Attention: this Docker route is text-only. It doesn’t load the mmproj, so it won’t work for the full webcam demo. It’s ideal if you only want to experiment with the model via text without compiling.
Problemas comunes y soluciones rápidas
No sound: run pactl list short sinks and make sure SPK_DEVICE matches a real sink.
Mic records silence: run arecord -l to verify the device, then test manually with arecord -D "$MIC_DEVICE" -f S16_LE -r 16000 -c 1 -d 3 /tmp/test.wav.
Slow first run: expected. It’s downloading models and generating voice assets. Subsequent runs are faster.
Out of memory: repeat process cleanup, enable swap, lower -ngl or use the Q3 quant if needed.
Un par de detalles finales para que funcione bien
The mmproj file is essential for vision. Don’t skip it.
If memory is still tight, prefer Q4_K_M or drop to Q3 as a last resort.
Keep the system as clean as possible before starting: every megabyte counts.
The experience is straightforward: you speak, the model decides if it needs to look, and it replies using what it saw. It’s a nice example of how native tool calls in LLMs enable emergent, practical behaviors on embedded devices.