Vision-Language Models (VLMs) combine visual perception with natural language reasoning. Why does this matter at the edge? Because now you can have an AI that looks, interprets and reasons in real time right next to your robots or embedded devices, without depending on the cloud.
What you'll find in this tutorial
I show you how to deploy the NVIDIA Cosmos Reasoning 2B (FP8) model on the Jetson family using the vLLM runtime. We'll cover hardware and software requirements, the commands to download the model, how to launch the container, and how to connect the Live VLM WebUI for real-time webcam analysis.
Devices and minimum requirements
Supported devices:
Jetson AGX Thor
Jetson AGX Orin (64GB / 32GB)
Jetson Orin Super Nano
JetPack:
JetPack 6 (L4T r36.x) for Orin
JetPack 7 (L4T r38.x) for Thor
NVMe SSD storage recommended:
~5 GB for FP8 weights
~8 GB for the vLLM container image
Account: create a free NVIDIA NGC account to download the model and container.
Preparation: download the NGC CLI and the FP8 model
Create a working directory and download the NGC CLI for ARM64. Quick example:
mkdir -p ~/Projects/CosmosReasoning
cd ~/Projects/CosmosReasoning
# Download the ARM64 installer (adjust the URL if the version changes)
wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip
chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
ngc config set
During ngc config set you'll be asked for your API Key (generate one from the NGC portal). Then download the FP8-quantized checkpoint:
cd ~/Projects/CosmosReasoning
ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"
This creates a folder like cosmos-reason2-2b_v1208-fp8-static-kv8/. Save the full path — you'll mount it into the container.
vLLM containers by device
Jetson AGX Thor:
Image: ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 (Thor uses the image optimized for its environment)
Jetson AGX Orin and Orin Super Nano:
Image: nvcr.io/nvidia/vllm:26.01-py3
General flow:
Download the FP8 checkpoint from NGC
Pull the appropriate vLLM container
Launch the container mounting the model folder
Start vllm serve and validate the API
Launching vLLM on Jetson AGX Thor (example)
Mount the model and launch the container (Thor has plenty of memory, so we use a long context):
git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git
cd live-vlm-webui
./scripts/start_container.sh
Open https://localhost:8090 in your browser and accept the certificate. In the VLM API Config section put http://localhost:8000/v1 and refresh to detect the model.
Useful WebUI settings for Orin:
Max Tokens: 100-150 for quick answers
Frame Processing Interval: 60+ to give time between frames
Common issues and fixes
vLLM fails with OOM:
Run sudo sysctl -w vm.drop_caches=3 before starting.
Lower --gpu-memory-utilization to 0.55 or 0.50.
Reduce --max-model-len to 128 or less.
Make sure there are no other GPU-heavy processes running.
Model doesn't show up in WebUI:
Check curl http://localhost:8000/v1/models.
Make sure you use http:// and not https:// for the base URL.
If WebUI and vLLM are in different containers, use http://<jetson-ip>:8000/v1.
Very slow responses:
This is expected on memory-constrained setups; prioritize stability over speed.
Reduce max_tokens and increase the frame interval.
Model path not found:
Verify the NGC download completed and the folder exists.
Check the -v in the docker run command and ensure the in-container path matches what you pass to vllm serve.
What is this useful for in practice?
Think of robots that describe what they see and justify decisions, industrial cameras that detect anomalies and explain why, or visual-assistant prototypes that reason about complex scenes. With Cosmos Reasoning 2B in FP8 and vLLM on Jetson you can bring reasoning prototypes to the edge, with local privacy and low latency.
The key is balancing context and memory: Thor and Orin allow long contexts; the Orin Super Nano needs tweaks to fit, but is still useful for tests and demos.