Nemotron OCR v2: Fast Multilingual OCR with Synthetic Data

Apr 17, 2026Keryc Díaz4 minutes

Nemotron OCR v2 demonstrates something many of us suspected but few had measured: with enough realistic synthetic data you can train a multilingual OCR that is both accurate and fast. How did they do it? By combining a rendered-data farm with an architecture that reuses features to avoid redundant work.

Qué lograron y por qué importa

They built a multilingual OCR model (English, Chinese, Japanese, Korean, Russian) that reaches near-zero errors on synthetic tests and processes 34.7 pages per second on a single A100 GPU. That’s not speed for speed’s sake: it means cheaper production pipelines and real-time responses for apps that need to read documents in many languages without pre-detecting the language.

The main lever was massive use of synthetic data: 12.2 million pages generated with pixel-perfect annotation (word-, line-, and paragraph-level boxes, plus reading-order graphs). The public dataset is nvidia/OCR-Synthetic-Multilingual-v1 and the model is available as nvidia/nemotron-ocr-v2.

Datos sintéticos: la receta y por qué funciona

Why synthetic data and not just scraping or human labeling? Because programmatic rendering gives you scale and pure labels. Every box, transcription and reading order is known exactly: you put it there.

Key ingredients:

Text source: they use mOSCAR to sample texts with realistic per-language distributions.
Font set: between 165 and 1,258 fonts per language (Google Fonts, Noto, etc.).
Rendering engine: a heavily modified version of SynthDoG with extensions for multi-structure layouts.
Extensive augmentations: glyph effects, dilation/erosion, blur, color shifts, shadows, background textures and more.

Central advantage: you can control which edge cases appear (vertical columns, tables, slides, scattered text) and produce millions of examples per day on a single machine.

Extensiones importantes sobre SynthDoG

Multi-level annotations: word-, line- and paragraph-level boxes, with 4-point quads and hierarchical relations.
Relationship graph: explicit reading order between lines and paragraphs, crucial for multi-column documents or tables.
Varied layout modes: multi-column, vertical columns (important for Japanese and Chinese), tables, slides, etc.
Line-level recognition for CJK: avoids word segmentation when spaces aren’t consistent.

All of this reduces the classic bottleneck in multilingual OCR: it wasn’t the architecture, it was the lack of representative data.

El dataset en cifras

The set contains 12,258,146 samples across six languages. Breakdown by language:

Language	Total Samples	Train	Test	Validation
English	1,825,089	1,460,304	183,629	181,156
Japanese	1,889,137	1,502,712	193,779	192,646
Korean	2,269,540	1,814,994	227,091	227,455
Russian	1,724,733	1,380,404	171,678	172,651
Chinese (Simplified)	2,335,343	1,914,948	210,143	210,252
Chinese (Traditional)	2,214,304	1,772,280	221,867	220,157
Total	12,258,146	9,845,642	1,208,187	1,204,317

Arquitectura: diseño para velocidad y estructura

Nemotron OCR v2 uses a three-component end-to-end architecture:

Text Detector with RegNetX-8GF backbone.
Text Recognizer: a small pre-norm Transformer that decodes rectified crops.
Relational Model: a compact Transformer that predicts logical groupings and reading order.

The idea is simple and efficient: the expensive convolution runs once and its feature maps are reused by the detector, recognizer and relational model. That feature reuse is what enables the efficiency jump: 34.7 pages/s on A100.

Two available variants:

Variante	Idiomas	Nivel de región	Capas reconocedor	Charset	Parámetros
`v2_english`	English	Word	3	855	54M
`v2_multilingual`	EN, ZH, JA, KO, RU	Line	6	14,244	84M

Note: the multilingual recognizer is heavier because it handles a 14,244-token vocabulary and processes full lines, which reduces the need to segment in languages without spaces but affects throughput on very dense pages.

Resultados: precisión y velocidad

SintDoG (synthetic pages) NED: the multilingual model reduces NED to very low levels compared to Nemotron v1 and to specialized variants from other frameworks.

Language	PaddleOCR (base)	PaddleOCR (specialized)	OpenOCR (server)	Nemotron OCR v1	Nemotron OCR v2 (multi)
English	0.117	0.096	0.105	0.078	0.069
Japanese	0.201	0.201	0.586	0.723	0.046
Korean	0.943	0.133	0.837	0.923	0.047
Russian	0.959	0.163	0.950	0.564	0.043
Chinese (Simplified)	0.054	0.054	0.061	0.784	0.035
Chinese (Traditional)	0.094	0.094	0.127	0.700	0.065

OmniDocBench (real-world documents) shows a clear tradeoff between speed and accuracy on mixed-language workloads:

Model	pages/s	EN	ZH	Mixed
PaddleOCR v5 (server)	1.2	0.027	0.037	0.041
OpenOCR (server)	1.5	0.024	0.033	0.049
Nemotron OCR v2 (multi)	34.7	0.048	0.072	0.142
Nemotron OCR v2 (EN)	40.7	0.038	0.830	0.437
Nemotron OCR v1	39.3	0.038	0.876	0.436
EasyOCR	0.4	0.095	0.117	0.326

The table makes a practical point clear: if your workload is multilingual and you don’t want to detect language first, the multilingual variant offers better overall results and a huge speedup compared to pipeline solutions that run separate detectors and recognizers.

Tradeoffs y decisiones de diseño

Line-level recognition for CJK avoids word segmentation and improves robustness for Japanese and Chinese.
A very large vocabulary (14,244 tokens) requires a deeper recognizer and reduces throughput on very text-dense pages.
Reusing the backbone reduces latency, but requires care in how regions are extracted and rectified for the recognizer.

If you work in product: think about the balance between multilingual support and pages-per-second performance. For single-language documents with little text, the English variant may be more efficient; for mixed or international flows, the multilingual model simplifies operations.

Licencia, demos y cómo probarlo

Model: nvidia/nemotron-ocr-v2 (NVIDIA Open Model License)
Dataset: nvidia/OCR-Synthetic-Multilingual-v1 (CC-BY-4.0)
Demo: Nemotron OCR v2 Space on Hugging Face to try it in your browser

This makes integration and quick evaluation in your data flow straightforward.

In the end, the most interesting thing isn’t just the pages-per-second number or a NED table: it’s the demonstration that, with a robust renderer and good augmentations, synthetic data stops being an approximation and becomes a practical solution to scale OCR across many languages without the prohibitive cost of human annotation.

Fuente original

https://huggingface.co/blog/nvidia/nemotron-ocr-v2

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.