OpenVINO and SmolVLM: VLM on Intel CPUs in 3 steps

OpenVINO and SmolVLM: VLM on Intel CPUs in 3 steps | Keryc

Paso 0: instala lo necesario

To follow the tutorial you need optimum and openvino. The recommended install is:
```
pip install optimum-intel[openvino] transformers==4.52.*
```
Running this gives you the tools to convert Hugging Face models to OpenVINO and apply optimizations. (huggingface.co)

Paso 1: convierte tu modelo a OpenVINO

You have two options: use the CLI or convert when loading the model from code. For example, with the CLI:

optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct smolvlm_ov/

Or in Python when loading the model:

from optimum.intel import OVModelForVisualCausalLM
model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
model = OVModelForVisualCausalLM.from_pretrained(model_id)
model.save_pretrained("smolvlm_ov")

Converting to OpenVINO generates the IR format optimized for Intel hardware. (huggingface.co)

Paso 2: cuantización

Quantization reduces the precision of weights or activations to speed up inference and reduce memory. Optimum supports two main methods:

Weight Only Quantization (WOQ): only the weights are converted to lower precision; it’s fast and tends to preserve accuracy better.
Static Quantization: quantizes weights and activations, and requires calibration with a representative sample set.

An example with WOQ in Python:

from optimum.intel import OVWeightQuantizationConfig, OVModelForVisualCausalLM
q_config = OVWeightQuantizationConfig(bits=8)
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config)
q_model.save_pretrained("smolvlm_int8")

And for a mixed setup where the vision part receives static quantization:

from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig
q_config = OVPipelineQuantizationConfig(
    quantization_configs={
        "lm_model": OVWeightQuantizationConfig(bits=8),
        "text_embeddings_model": OVWeightQuantizationConfig(bits=8),
        "vision_embeddings_model": OVQuantizationConfig(bits=8),
    },
    dataset=dataset,
    num_samples=num_samples,
)

It’s important to test accuracy after quantizing; the speedup can come with a small drop in fidelity. (huggingface.co)

Paso 3: ejecutar la inferencia

Once quantized, running inference is straightforward:
```
generated_ids = q_model.generate(**inputs, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])
```
If you have a recent Intel laptop or a discrete Intel GPU, you can load the model on device="gpu" to take advantage of it. There’s also a Space demo to try the quantized variants. (huggingface.co)

Qué es esto y por qué debería importarte

Cómo hacerlo en 3 pasos

Resultados que reportan y qué significan para ti

Recomendaciones prácticas antes de empezar

Para terminar

Stay up to date!

OpenVINO and SmolVLM: VLM on Intel CPUs in 3 steps