The good news: you can now run a lightweight Vision Language Model (VLM) on a laptop or PC with an Intel CPU without needing a huge GPU farm. Sounds good, right? In this guide I walk you, step by step and without inaccessible jargon, through converting, quantizing, and running a VLM using Optimum Intel and OpenVINO. (huggingface.co)
Qué es esto y por qué debería importarte
VLMs are models that combine vision and language to describe images, generate captions, or answer questions about visual content. Running them locally gives you two clear advantages: your data stays on your machine and latency is usually much lower than relying on a remote server. In this case the recipe uses SmolVLM
, a small model made for limited resources, together with Optimum Intel and OpenVINO to optimize deployment. (huggingface.co)
Cómo hacerlo en 3 pasos
The core idea is simple: convert, quantize, and run. Each step includes concrete commands you can copy and paste.
-
Paso 0: instala lo necesario
To follow the tutorial you need
optimum
andopenvino
. The recommended install is:pip install optimum-intel[openvino] transformers==4.52.*
Running this gives you the tools to convert Hugging Face models to OpenVINO and apply optimizations. (huggingface.co)
-
Paso 1: convierte tu modelo a OpenVINO
You have two options: use the CLI or convert when loading the model from code. For example, with the CLI:
optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct smolvlm_ov/
Or in Python when loading the model:
from optimum.intel import OVModelForVisualCausalLM model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct" model = OVModelForVisualCausalLM.from_pretrained(model_id) model.save_pretrained("smolvlm_ov")
Converting to OpenVINO generates the IR format optimized for Intel hardware. (huggingface.co)
-
Paso 2: cuantización
Quantization reduces the precision of weights or activations to speed up inference and reduce memory. Optimum supports two main methods:
- Weight Only Quantization (WOQ): only the weights are converted to lower precision; it’s fast and tends to preserve accuracy better.
- Static Quantization: quantizes weights and activations, and requires calibration with a representative sample set.
An example with WOQ in Python:
from optimum.intel import OVWeightQuantizationConfig, OVModelForVisualCausalLM q_config = OVWeightQuantizationConfig(bits=8) q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) q_model.save_pretrained("smolvlm_int8")
And for a mixed setup where the vision part receives static quantization:
from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig q_config = OVPipelineQuantizationConfig( quantization_configs={ "lm_model": OVWeightQuantizationConfig(bits=8), "text_embeddings_model": OVWeightQuantizationConfig(bits=8), "vision_embeddings_model": OVQuantizationConfig(bits=8), }, dataset=dataset, num_samples=num_samples, )
It’s important to test accuracy after quantizing; the speedup can come with a small drop in fidelity. (huggingface.co)
-
Paso 3: ejecutar la inferencia
Once quantized, running inference is straightforward:
generated_ids = q_model.generate(**inputs, max_new_tokens=100) generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) print(generated_texts[0])
If you have a recent Intel laptop or a discrete Intel GPU, you can load the model on
device="gpu"
to take advantage of it. There’s also a Space demo to try the quantized variants. (huggingface.co)
Resultados que reportan y qué significan para ti
In the tests they published, converting to OpenVINO drastically reduces latency compared to the PyTorch version. Some key numbers on Intel CPU were:
- PyTorch: TTFT ~ 5.15 s, throughput ~ 0.72 tokens/s.
- OpenVINO: TTFT ~ 0.42 s, throughput ~ 47 tokens/s.
- OpenVINO with 8-bit WOQ: TTFT ~ 0.247 s, throughput ~ 63.9 tokens/s.
In other words, moving to OpenVINO can give you an orders-of-magnitude speedup in time-to-first-token and multiply throughput. These results were obtained on a specific setup with Intel CPUs and OpenVINO 2025.2.0; your experience will vary with hardware and settings. (huggingface.co)
Converting and quantizing can turn a small model into something usable on a normal laptop without expensive GPUs. (huggingface.co)
Recomendaciones prácticas antes de empezar
-
If you want to experiment quickly, start with weight-only quantization; it’s usually the lowest risk for quality.
-
If your use case processes many images or you want max efficiency in the vision encoder, try static quantization for that piece and measure the difference.
-
Always validate results and create a representative test set for calibration and evaluation after quantizing.
-
Check the notebook and the Space that accompany the tutorial for ready-to-run examples. (huggingface.co)
Para terminar
You don’t need giant infrastructure to try multimodal capabilities today. With small models like SmolVLM and tools like Optimum Intel and OpenVINO you can bring a VLM to a local machine and get latencies useful for prototypes or private apps. Want me to guide you with commands adapted to your machine or a practical example for your CPU? I can help put together the concrete steps for your laptop.