Gemma 4 runs VLA on Jetson Orin Nano Super

Apr 22, 2026Keryc Díaz4 minutes

This tutorial shows how to run a VLA (Vision-Language Assistant) demo with Gemma 4 on an NVIDIA Jetson Orin Nano Super 8 GB. What's the standout bit? The whole flow is local, and the decision of when to look comes from the model — not from hardcoded rules. I'll walk you through the technical steps so you can replicate it without getting lost in unnecessary jargon.

Qué hace esta demo VLA

The pipeline is simple: you speak -> Parakeet STT -> Gemma 4 -> (if needed) take a photo with the webcam -> Kokoro TTS -> speaker. You press SPACE to start recording, and SPACE again to stop. But what does VLA mean here? It means the model decides, by context, whether it needs to see something. No keywords, no rigid logic.

If you ask something that requires seeing the environment, Gemma 4 calls the look_and_answer tool, the demo captures an image, sends it to the model, and Gemma replies using what it saw. It doesn’t describe the photo for the sake of describing it; it answers your question with that visual context.

Key point: the demo runs on an 8 GB Jetson Orin Nano. Impressive, right? But you do need to prepare the board.

Hardware y software que se usó

NVIDIA Jetson Orin Nano Super (8 GB)
Logitech C920 webcam (built-in mic)
USB speaker
USB keyboard (to press SPACE)

You’re not tied to those exact models: any webcam, USB mic and speaker that Linux recognizes should work.

Preparación del sistema (resumen técnico)

First, update and install basic packages for audio, webcam and Python development. Key commands:

sudo apt update
sudo apt install -y git build-essential cmake curl wget pkg-config \
  python3-pip python3-venv python3-dev \
  alsa-utils pulseaudio-utils v4l-utils psmisc \
  ffmpeg libsndfile1

Create a virtual environment and install Python dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install opencv-python-headless onnx_asr kokoro-onnx soundfile huggingface-hub numpy

Practical tip: on an 8 GB board it's crucial to free memory before launching large models. Close browsers, IDEs, and unneeded services.

Swap y limpieza de procesos (para evitar OOM)

Adding an 8 GB swap helps avoid the process being killed for lack of memory while loading the model:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Before starting the server, stop heavy processes like Docker if you don't need them, and kill services like tracker-miner-fs-3 or gnome-software if they're eating RAM.

Modelos, quantizaciones y por qué elegir Q4_K_M

From testing, Q4_K_M offers the best balance between capability and memory use on 8 GB. If you still run out of memory, you can drop to the Q3 version (file name includes Q3) to save RAM at the cost of some accuracy. But the recommendation is: try Q4_K_M first.

You also need the mmproj file which is the vision projector. Without it, Gemma can’t "see".

Build nativo de llama.cpp y lanzamiento de llama-server

You build a native llama.cpp for better performance and support for the vision projector. Essential steps:

cd ~
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87" \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4

Download the models into ~/models:

mkdir -p ~/models && cd ~/models
wget -O gemma-4-E2B-it-Q4_K_M.gguf \
  https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf
wget -O mmproj-gemma4-e2b-f16.gguf \
  https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/mmproj-gemma4-e2b-f16.gguf

Launch llama-server with the mmproj and recommended flags:

~/llama.cpp/build/bin/llama-server \
  -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj ~/models/mmproj-gemma4-e2b-f16.gguf \
  -c 2048 \
  --image-min-tokens 70 --image-max-tokens 70 \
  --ubatch-size 512 --batch-size 512 \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 --flash-attn on \
  --no-mmproj-offload --jinja -np 1

Technical note: -ngl 99 tries to push all layers to the GPU. If you hit memory problems, lower that number for partial offload to CPU.

Quick endpoint test:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4","messages":[{"role":"user","content":"Hi!"}],"max_tokens":32}' \
  | python3 -m json.tool

If you get JSON, the server is responding.

Demo Python: Gemma4_vla.py y flujo completo

The Gemma4_vla.py script (one single file) orchestrates everything: local STT (Parakeet), the call to llama-server and TTS (Kokoro). You can clone the full repo or download just the script:

Clone repo:

git clone https://github.com/asierarranz/Google_Gemma.git
cd Google_Gemma/Gemma4

Or download only the script:

wget https://raw.githubusercontent.com/asierarranz/Google_Gemma/main/Gemma4/Gemma4_vla.py

On first run the script downloads Parakeet STT, Kokoro TTS and generates voice WAVs. Then: press SPACE to record, speak, press SPACE to stop and Gemma answers.

Environment variables and an example run:

export MIC_DEVICE="plughw:3,0"
export SPK_DEVICE="alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo"
export WEBCAM=0
export VOICE="af_jessica"
source .venv/bin/activate
python3 Gemma4_vla.py

There’s also a text-only mode if you want to test just the LLM part without audio:

python3 Gemma4_vla.py --text

The demo defines exactly one tool exposed to Gemma 4:

{
  "name": "look_and_answer",
  "description": "Take a photo with the webcam and analyze what is visible."
}

When the model decides it needs vision, it invokes look_and_answer, the demo takes the photo, sends it to the server and the response is synthesized with Kokoro.

The magic here is the --jinja flag in llama-server, which enables native tool-call support in Gemma.

Docker: opción más rápida pero limitada

If you don’t want to compile everything, NVIDIA published a prebuilt Docker image for Orin. It’s one line:

sudo docker run -it --rm --pull always \
  --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S

Attention: this Docker route is text-only. It doesn’t load the mmproj, so it won’t work for the full webcam demo. It’s ideal if you only want to experiment with the model via text without compiling.

Problemas comunes y soluciones rápidas

No sound: run pactl list short sinks and make sure SPK_DEVICE matches a real sink.
Mic records silence: run arecord -l to verify the device, then test manually with arecord -D "$MIC_DEVICE" -f S16_LE -r 16000 -c 1 -d 3 /tmp/test.wav.
Slow first run: expected. It’s downloading models and generating voice assets. Subsequent runs are faster.
Out of memory: repeat process cleanup, enable swap, lower -ngl or use the Q3 quant if needed.

Un par de detalles finales para que funcione bien

The mmproj file is essential for vision. Don’t skip it.
If memory is still tight, prefer Q4_K_M or drop to Q3 as a last resort.
Keep the system as clean as possible before starting: every megabyte counts.

The experience is straightforward: you speak, the model decides if it needs to look, and it replies using what it saw. It’s a nice example of how native tool calls in LLMs enable emergent, practical behaviors on embedded devices.

Fuente original

https://huggingface.co/blog/nvidia/gemma4

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.