Nemotron OCR v2: Fast Multilingual OCR with Synthetic Data | Keryc
Nemotron OCR v2 demonstrates something many of us suspected but few had measured: with enough realistic synthetic data you can train a multilingual OCR that is both accurate and fast. How did they do it? By combining a rendered-data farm with an architecture that reuses features to avoid redundant work.
Qué lograron y por qué importa
They built a multilingual OCR model (English, Chinese, Japanese, Korean, Russian) that reaches near-zero errors on synthetic tests and processes 34.7 pages per second on a single A100 GPU. That’s not speed for speed’s sake: it means cheaper production pipelines and real-time responses for apps that need to read documents in many languages without pre-detecting the language.
The main lever was massive use of synthetic data: 12.2 million pages generated with pixel-perfect annotation (word-, line-, and paragraph-level boxes, plus reading-order graphs). The public dataset is nvidia/OCR-Synthetic-Multilingual-v1 and the model is available as nvidia/nemotron-ocr-v2.
Datos sintéticos: la receta y por qué funciona
Why synthetic data and not just scraping or human labeling? Because programmatic rendering gives you scale and pure labels. Every box, transcription and reading order is known exactly: you put it there.
Key ingredients:
Text source: they use mOSCAR to sample texts with realistic per-language distributions.
Font set: between 165 and 1,258 fonts per language (Google Fonts, Noto, etc.).
Rendering engine: a heavily modified version of SynthDoG with extensions for multi-structure layouts.
Extensive augmentations: glyph effects, dilation/erosion, blur, color shifts, shadows, background textures and more.
Central advantage: you can control which edge cases appear (vertical columns, tables, slides, scattered text) and produce millions of examples per day on a single machine.
Extensiones importantes sobre SynthDoG
Multi-level annotations: word-, line- and paragraph-level boxes, with 4-point quads and hierarchical relations.
Relationship graph: explicit reading order between lines and paragraphs, crucial for multi-column documents or tables.
Varied layout modes: multi-column, vertical columns (important for Japanese and Chinese), tables, slides, etc.
Line-level recognition for CJK: avoids word segmentation when spaces aren’t consistent.
All of this reduces the classic bottleneck in multilingual OCR: it wasn’t the architecture, it was the lack of representative data.
El dataset en cifras
The set contains 12,258,146 samples across six languages. Breakdown by language:
Language
Total Samples
Train
Test
Validation
English
1,825,089
1,460,304
183,629
181,156
Japanese
1,889,137
1,502,712
193,779
192,646
Korean
2,269,540
1,814,994
227,091
227,455
Russian
1,724,733
1,380,404
171,678
172,651
Chinese (Simplified)
2,335,343
1,914,948
210,143
210,252
Chinese (Traditional)
2,214,304
1,772,280
221,867
220,157
Total
12,258,146
9,845,642
1,208,187
1,204,317
Arquitectura: diseño para velocidad y estructura
Nemotron OCR v2 uses a three-component end-to-end architecture:
Text Detector with RegNetX-8GF backbone.
Text Recognizer: a small pre-norm Transformer that decodes rectified crops.
Relational Model: a compact Transformer that predicts logical groupings and reading order.
The idea is simple and efficient: the expensive convolution runs once and its feature maps are reused by the detector, recognizer and relational model. That feature reuse is what enables the efficiency jump: 34.7 pages/s on A100.
Two available variants:
Variante
Idiomas
Nivel de región
Capas reconocedor
Charset
Parámetros
v2_english
English
Word
3
855
54M
v2_multilingual
EN, ZH, JA, KO, RU
Line
6
14,244
84M
Note: the multilingual recognizer is heavier because it handles a 14,244-token vocabulary and processes full lines, which reduces the need to segment in languages without spaces but affects throughput on very dense pages.
Resultados: precisión y velocidad
SintDoG (synthetic pages) NED: the multilingual model reduces NED to very low levels compared to Nemotron v1 and to specialized variants from other frameworks.
Language
PaddleOCR (base)
PaddleOCR (specialized)
OpenOCR (server)
Nemotron OCR v1
Nemotron OCR v2 (multi)
English
0.117
0.096
0.105
0.078
0.069
Japanese
0.201
0.201
0.586
0.723
0.046
Korean
0.943
0.133
0.837
0.923
0.047
Russian
0.959
0.163
0.950
0.564
0.043
Chinese (Simplified)
0.054
0.054
0.061
0.784
0.035
Chinese (Traditional)
0.094
0.094
0.127
0.700
0.065
OmniDocBench (real-world documents) shows a clear tradeoff between speed and accuracy on mixed-language workloads:
Model
pages/s
EN
ZH
Mixed
PaddleOCR v5 (server)
1.2
0.027
0.037
0.041
OpenOCR (server)
1.5
0.024
0.033
0.049
Nemotron OCR v2 (multi)
34.7
0.048
0.072
0.142
Nemotron OCR v2 (EN)
40.7
0.038
0.830
0.437
Nemotron OCR v1
39.3
0.038
0.876
0.436
EasyOCR
0.4
0.095
0.117
0.326
The table makes a practical point clear: if your workload is multilingual and you don’t want to detect language first, the multilingual variant offers better overall results and a huge speedup compared to pipeline solutions that run separate detectors and recognizers.
Tradeoffs y decisiones de diseño
Line-level recognition for CJK avoids word segmentation and improves robustness for Japanese and Chinese.
A very large vocabulary (14,244 tokens) requires a deeper recognizer and reduces throughput on very text-dense pages.
Reusing the backbone reduces latency, but requires care in how regions are extracted and rectified for the recognizer.
If you work in product: think about the balance between multilingual support and pages-per-second performance. For single-language documents with little text, the English variant may be more efficient; for mixed or international flows, the multilingual model simplifies operations.
Licencia, demos y cómo probarlo
Model: nvidia/nemotron-ocr-v2 (NVIDIA Open Model License)
Demo: Nemotron OCR v2 Space on Hugging Face to try it in your browser
This makes integration and quick evaluation in your data flow straightforward.
In the end, the most interesting thing isn’t just the pages-per-second number or a NED table: it’s the demonstration that, with a robust renderer and good augmentations, synthetic data stops being an approximation and becomes a practical solution to scale OCR across many languages without the prohibitive cost of human annotation.