Hugging Face has just published a practical guide to speed up OCR projects using open models. Have you ever wondered what vision-and-language models can do today with complex documents, tables, or handwriting? This guide walks you step by step from what to choose to how to run the models, and explains practical advantages like cost and privacy.
What the guide includes and why it matters
The piece was published on October 21, 2025 and works as a quick map of the current ecosystem: transcription capabilities, output formats (HTML, Markdown, DocTag, JSON), highlighted models, and tools to run inference both locally and in the cloud. It's useful if you work with invoices, historical archives, or want to turn large repositories of PDFs into usable data. (huggingface.co)
Real capabilities of modern models
Today models don't just convert printed text into digital text. They can handle handwriting, multiple scripts, equations and formulas, as well as recognize tables, charts and insert image descriptions where appropriate. Practical result? Less post-processing and fewer manual rules to reconstruct the document structure. (huggingface.co)
If your goal is to feed an
LLMto ask questions about a document, preferring models that output Markdown with image captions saves you a lot of work.
Open models worth looking at (summary)
Hugging Face collects several standout models and compares outputs and sizes. Some relevant examples:
- Nanonets-OCR2-3B: structured output in Markdown and HTML tables, handles signatures and checkboxes.
- PaddleOCR-VL: 0.9B parameters, support for 109 languages and prompting capability.
- OlmOCR: optimized for large-scale batch processing.
- Granite-Docling (258M): uses DocTags and allows changing tasks via prompts.
Each model has different strengths depending on output format and end use: digital reconstruction, input for an LLM, or programmatic analysis. It's worth reviewing each model's card before deciding. (huggingface.co)
How to choose and evaluate — benchmarks and economics
There is no universal model. Hugging Face recommends using benchmarks like OmniDocBenchmark or OlmOCR-Bench, but also emphasizes that your documents may not be well represented by those tests. Practical recommendation? Collect a representative sample from your own domain and compare several models.
In terms of cost, the authors show approximate calculations: open models with 3B to 7B parameters are common and, according to the guide, the cost per million pages can fall in comparable ranges across different models if you use optimized instances; for example, OlmOCR is illustrated with a reference cost per million pages under specific conditions. That means, besides accuracy, you should evaluate availability of optimized implementations and quantization options to reduce costs. (huggingface.co)
How to run the models: local and remote
The guide shows practical paths:
- Locally with
vLLMor usingtransformersfor direct loading. Simple example to serve a local model:
vllm serve nanonets/Nanonets-OCR2-3B
And an inference flow from Python with an OpenAI-like client to send images in base64. There are also examples using the transformers API and AutoModelForImageTextToText to generate outputs in HTML or Markdown. (huggingface.co)
- Remotely using Hugging Face Inference Endpoints for managed deployment, or Hugging Face Jobs along with ready scripts to process batches of images without managing infrastructure. This makes it easier to go from prototype to production when you need to process thousands or millions of pages. (huggingface.co)
Beyond OCR: visual retrieval and document QA
If your goal isn't just to extract text, the guide explains how to build multimodal pipelines: index PDFs with visual retrievers and combine that with VLMs to answer questions directly about documents. Why is this useful? Because it avoids context loss that happens when you convert everything to plain text and then ask an LLM. (huggingface.co)
Practical recommendations and risks to consider
- Try several models with a real sample of your documents before choosing one.
- If you need privacy or cost savings at scale, open-weight models are often more efficient and transparent than closed options.
- Be careful with the quality of evaluation datasets: many benchmarks use transcriptions generated by models or automated pipelines, not only human annotations. That's why validating with human data representative of your domain is key. (huggingface.co)
To dive deeper
If you want to read the full guide and try the demos they mention, the original post is on the Hugging Face site: Supercharge your OCR Pipelines with Open Models. They also include links to demos, benchmarks and ready-to-use scripts. (huggingface.co)
Think of this as a practical invitation: the technology is no longer only for labs, it's ready to be integrated into real business and research processes. Ready to try it with your own documents?
