DPO reduces degeneration in OCR: evidence beyond chatbots | Keryc
Publishing new models sometimes reveals more than a bump in accuracy: it reveals why they fail. DharmaAI released DharmaOCR and, besides showing better performance in structured OCR, they applied Direct Preference Optimization (DPO) as a second training stage to tackle a persistent failure: text degeneration — those infinite repetitions that turn a transcription into a loop.
What DharmaOCR found
The study used 23,726 documents and compared model families (open source and commercial) in OCR for Brazilian Portuguese. Key result: the degeneration rate in unadjusted models ranged from under 1% up to over 33%.
Supervised fine-tuning (SFT) lowered that rate in many cases, but rarely to production-ready levels.
That's where DPO comes in. Applied as a second stage on the same set, DPO reduced degeneration across all families tested. On average: a 59.4% relative reduction versus SFT. Observed range: 37% to 87.6%.
Concrete example: Nanonets-OCR2-3B went from 1.61% down to 0.20%. Even models that started with very high degeneration saw very significant improvements.
A curious case: Qwen2.5-VL-3B had 0.60% in its vanilla version, rose to 3.23% after SFT, then dropped to 1.41% with DPO. Why? SFT made the model capable of attempting the task with longer outputs; with that capacity the geometry of degeneration appeared for the first time.
Why SFT isn't enough: the geometry of the failure
Have you wondered why fine-tuning with correct data doesn't erase certain failures? The technical explanation points to the granularity of the loss. SFT optimizes token by token: every prediction is evaluated in isolation. A repetitive loop can be locally probable and escape sequence-level penalties.
DPO flips that logic: it works with preferences over the complete sequence (chosen output vs rejected output). That lets you label a degenerated transcription as the wrong output in its entirety, not just as a sequence of low-probability tokens. Degeneration isn't just a decoder artifact: it's a property of the learned distribution and the "attractors" where probability concentrates during inference (see Holtzman et al., 2020).
Interventions at inference time (repetition penalties, temperature, early aborts) treat the symptom. DPO attacks the shape of the distribution that generates that symptom.
How they applied DPO in OCR: the technical pipeline
The practical idea was simple and elegant:
Generate multiple candidates per document using the SFT.
Evaluate each candidate with an automatic judge (an LLM with task-specific criteria).
Build preference pairs (chosen, rejected). Important: outputs with degeneration were deliberately kept as rejected examples.
Train DPO on those pairs, explicitly penalizing degenerated outputs at the sequence level.
The automatic scoring applied four quality criteria to distinguish clean transcriptions from repetitive loops. Massive human annotation wasn't needed: the signal came from the model's own failures. DharmaAI describes this as "preference-guided implicit unlikelihood": you push the model toward good outputs and pull it away from a concrete class of failures.
Structural conditions for this to work in other tasks
It's not magic that applies everywhere: the method requires three clear conditions:
The failure mode must be categorical and recognizable (for example, repetition loops), not just low-quality noise.
There must be an automatic scoring mechanism sufficiently consistent to label chosen/rejected pairs without relying on human annotators.
There must be volume: enough inference outputs to build a preference dataset with variety.
If these conditions hold, using the model's own failures as negative signal is a real option.
Practical implications for engineers and ML teams
SFT is still necessary: it brings the model's distribution closer to the target domain. But it's not enough to fix certain structural failures.
Adding a DPO stage afterwards is a one-time investment that can reduce degeneration without sacrificing extraction quality.
In practice: generate N candidates per input, use a judge (rules + LLM) to detect loops and keep those examples as rejected.
Expect large but variable gains; in the benchmark there were reductions from 37% to 88% depending on architecture and starting point.
Keep in mind the relation between capacity and exposure to failures: increasing capability can expose the model to new degeneration attractors; DPO and SFT address different dimensions of the problem.
The important thing: don't dismiss defective outputs as noise. If the failure is consistent and recognizable, that same failure is the best signal to teach the model not to repeat it.
Final reflection
DharmaOCR shows that DPO isn't just an alignment tool for chatbots: it's an engineering technique useful to mitigate a concrete class of failure in structured generation. The practical message is clear: when the error is categorical, frequent, and automatically evaluable, turn your garbage into useful training. Isn't it curious that the model tells us exactly where it hurt?