olmOCR 2: AI that improves OCR for PDF documents | Keryc
olmOCR 2 arrives as a tool designed to read tricky PDFs and turn them into structured text without leaning on brittle rules. Can you imagine uploading an academic paper full of equations, tables and multiple columns and getting ready-to-use Markdown, HTML and LaTeX back? That's exactly what this release promises.
What is olmOCR 2
olmOCR 2 (olmOCR-2-7B-1025) is Ai2's new document reader that combines vision and language to transcribe complex pages in a single pass. The team presents it as an end-to-end solution that generates structure (headings in Markdown, tables in HTML, equations in LaTeX) directly in the output. (allenai.org)
The innovation: training with unit tests as a reward
The core idea here is training against things you can verify automatically. Instead of optimizing only for a similarity metric, Ai2 turned verifiable properties of the document into unit tests that return pass or fail. Those tests act as rewards during training, so the model learns to produce outputs that are verifiably correct. ()
And how do they generate data with those verifiable properties? They built a synthetic pipeline that takes real pages, re-renders them as semantic HTML (analyzed with Claude Sonnet 4), and from that HTML derive exact targets and programmatic test cases. That process allowed them to create thousands of examples with automatic tests built in to supervise learning. (allenai.org)
Techniques and architecture in brief
olmOCR 2 is fine-tuned on Qwen2.5-VL-7B using the dataset olmOCR-mix-1025 (about 270,000 pages) and uses an RL algorithm called GRPO to optimize binary rewards from the tests. During training the model generates multiple completions per page and the ones that pass more tests get higher reward; this prioritizes structural fidelity over vague approximations. (allenai.org)
The important bit: the system learns to produce the correct structure, not just text that looks similar. That difference cuts down on typical errors in multi-stage pipelines that stitch together OCR, table detection and postprocessing.
Real-world performance
On Ai2's benchmark, olmOCR 2 scores 82.4 on olmOCR-Bench, with notable improvements where OCR usually fails: old mathematics, dense tables and multi-column pages. For example, tables jump from 72.9 to 84.9, and scans of old math improve from 79.9 to 82.3 in the reported evaluation. Those numbers reflect concrete gains for structured reading. (allenai.org)
A concrete example: historical texts that used to be misread because of handwriting are now correctly transcribed in certain documented cases. That matters for researchers, archivists and compliance teams who rely on trustworthy data. (allenai.org)
Speed, deployment and availability
Ai2 provides the quantized model in FP8 for efficient deployment; according to the announcement, it reaches about 3,400 output tokens per second on an H100 GPU — enough to process 10,000 pages for less than $2 in their estimate. They also published the weights on Hugging Face and offer a demo plus the olmOCR toolkit with fine-tuning scripts and production pipelines. If you want to try or integrate it, everything is ready. (allenai.org)
What does this mean for you and your project?
If you work with large collections of PDFs (research, compliance, accessibility), this approach reduces the need for manual rule engineering. You can adapt the model with a few example pages and get better results without long postprocessing chains.
For startups and product teams, the combo of an open model, demo and toolkit speeds up experimentation. The model is available under an Apache 2.0 license and there are usage examples to get you started quickly. (huggingface.co)
For researchers, the method of turning verifiers into training signals is a practical example of aligning evaluation goals with training goals. It's an idea you can reuse in other domains where part of the output can be checked automatically.
Download the weights and the model card on Hugging Face if you want to run locally or on your infra. Model on Hugging Face. (huggingface.co)
Check out the olmOCR toolkit on GitHub for inference pipelines, page rendering and fine-tuning scripts.
Final reflection
olmOCR 2 isn't just another model that nudges a metric upward. It's a bet on training toward what really matters: verifiable, structured outputs that you can reliably plug into products and workflows. Want to put this to work on your documents? With a handful of examples and the open toolkit you can validate it for yourself.
Stay up to date!
Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.