Robotic AI on Embedded: VLA, Datasets and Optimizations | Keryc
The evolution of language models toward multimodal systems already allows vision and action to coexist in the same model. What’s the catch? Bringing those VLA (Vision–Language–Action) models to embedded hardware with real constraints on CPU, memory, NPU and real‑time requirements is hard. This technical article summarizes NXP’s practices for recording reliable datasets, fine‑tuning ACT policies and SmolVLA, and the on‑device optimizations that made those policies run on an i.MX95.
Why is it hard to run VLA on embedded platforms?
Because it’s not just compressing a model. It’s systems engineering: breaking the architecture apart, coding with latency awareness, and aligning execution to the available hardware. In real time, a slow inference causes the robot to pause, which creates oscillating corrections and worse recovery. Practical rule? Keep inference latency below the execution horizon: T_inference < T_execution.
In synchronous pipelines, while the VLA infers, the arm is idle. The solution is asynchronous inference: generate actions in parallel with execution. But to make it work you must guarantee inference time is shorter than the action chunk duration. That sets an upper bound on model throughput.
Dataset: quality over quantity
Want your policy to learn fine manipulations? Don’t feed it messy data. Here are the practical recommendations they used for the task of putting a tea bag into a cup.
Fixed camera mounts. Avoid pose drift from vibrations or re‑adjustments. A change after recording can sink your accuracy.
Controlled lighting. Keep sources fixed and avoid variable sunlight.
High contrast between arm, object and background. Avoid the classic white-on-white unless that’s your domain.
Backups of robot and teleoperator calibrations.
Don’t cheat when recording: only log what the policy will see at runtime, usually camera inputs, not the operator’s direct observations.
On number and type of cameras:
Mixing views increases accuracy but adds latency. In their case, the balance was 3 cameras: Top, Gripper and Left.
Top
Gripper
Left
Global view
Close view for precise grasps
Complements depth and height
The gripper camera improves success rate for fine manipulations and also forces good recording practices because the operator must trust the robot’s perception. Secure the cable with velcro or strain‑relief guides to avoid obstructions.
Small hardware tweaks help: heat‑shrink on the gripper increases friction and reduces slips, which lowers near‑success episodes and stabilizes learning.
Recording strategy
Vary initial positions: partition the workspace into clusters and record 10+ episodes per cluster. In the example they used 11 clusters of 10x10 cm.
Separate validation set: exclude an entire cluster from training to measure generalization (example: cluster 6).
Include wide movements and recovery: about 20% of episodes should be recovery, where the policy has to return to the object.
Final checkpoint: the one with lowest validation loss after 200k steps (with nuances of selection by real success)
Training: actions per chunk and checkpoint selection
They found that for ACT the best trade‑off was 100 actions per chunk with effective training between 100k and 160k steps. For SmolVLA (50 actions per chunk) many more steps are required. One important observation: sometimes continuing a little past apparent overfitting improves practical accuracy.
Golden rule: pick the final checkpoint by evaluating success on training and validation, not only training loss.
Practical architecture: decomposition into blocks
Avoid a monolithic graph. Split into logical blocks you can optimize and deploy independently:
Vision encoder: processes RGB frames and produces visual embeddings.
LLM backbone: turns visual and textual embeddings into action tokens.
Action expert: applies flow matching and iterative denoising to produce control commands.
This separation lets you quantify the impact of quantization per block and run the action expert at a lower frequency if beneficial.
Quantization and precision: real trade‑offs
Not all parts tolerate quantization equally. In their experience on i.MX95:
Vision encoder and LLM prefill lost little accuracy when quantized to 8b or even 4b in selected layers.
The iterative denoising part of the action expert suffers badly from quantization. Errors accumulate across iterative steps and degrade stability.
Practical decision: keep the action expert at higher precision and quantize the rest selectively. They also applied block‑specific optimizations to squeeze the hardware.
Asynchronous inference and latency‑aware scheduling
Conceptual comparison:
Synchronous: capture, complete inference, execution. The robot is idle during inference.
Asynchronous: while the current chunk is executed, the next chunk is generated in parallel.
Benefits of asynchrony: higher effective control frequency, fewer stale observations, better recovery. But remember the equation: asynchrony only helps if T_inference < T_execution.
On embedded systems it’s key to design a scheduler that:
Prioritizes latency for critical blocks
Places execution on CPU/GPU/NPU according to capability
Uses action queues with aggregation functions (for example, weighted_average) to smooth transitions between chunks
Results on i.MX95 (benchmarks)
i.MX95 integrates 6x Cortex‑A55, Cortex‑M7/M33, a Mali GPU, a new ISP and the eIQ Neutron NPU, designed for efficient inference with multi‑camera setups.
Table reproduced from the study (times and success rates on the tea bag to cup task):
Platform
Policy
Format
Inference latency
Test Accuracy (20)
Val Accuracy (10)
Global Accuracy (30)
i.MX95
ACT
ONNX FP32
2.86 s
1.00
0.90
0.96
i.MX95
ACT
Optimized
0.32 s
1.00
0.60
0.89
i.MX95
SmolVLA
ONNX FP32
29.10 s
0.50
0.40
0.47
They also reported an on‑board optimized SmolVLA version with 6.15 s latency in an improved phase.
Quick take: ACT can be viable in real time with aggressive optimizations. SmolVLA needs more NPU work and flow optimizations to cut latency without losing accuracy.
Practical roadmap to deploy VLA on embedded
If you want to replicate this in your project:
Preparation of the dataset
Validate mounts, calibrations and contrast
Include recovery episodes and clusters
Training
Save checkpoints and parameters every 20k steps
Evaluate by success on separate validation sets
Deployment on i.MX95 or another embedded
Break your model into blocks
Quantize selectively and keep sensitive parts at higher precision
Implement asynchronous inference and a latency scheduler
Next technical steps they propose: simulate to scale data and benchmarks, use RL to refine policies, and apply sim‑to‑real to close the domain gap.
Bringing VLA to embedded is not magic: it’s a compromise between clean data, architectural design and hardware‑aware optimization. If you do it right, your robot stops waiting and starts acting more fluidly.