← Home

Field notes

Why we built gurmukhifix: 600 years of manuscripts vs. one stubborn OCR problem

It started with a pile of scans. Thousands of them: handwritten manuscripts, ledgers and printed pages from north-west India spanning roughly the 1400s to the present day. Gurmukhi scripture and Punjabi correspondence. Persian and Urdu administrative records. Hindi and Devanagari texts. The goal was simple to state and hard to do — turn the images into searchable, reusable digital text.

The promise and the wall

Modern OCR feels like magic on clean printed English. So we pointed Tesseract at the collection and waited. What came back looked, at a glance, like text. But when we tried to search it, index it, or paste it into a document, it fell apart.

For handwritten Gurmukhi and Urdu, character error rates routinely ran past 30–40%. Worse, the errors weren't random noise — they were systematic, and they produced Unicode that was subtly, invisibly broken.

The bug you can't see

Take one example that haunted the Gurmukhi pages. The vowel sign sihari (ਿ) is written to the left of the consonant it belongs to — but the Unicode standard says it must be stored after that consonant. Tesseract, faithfully, writes down what it sees: the sihari first. The result renders almost correctly on screen, so it passes the eye test. Then you search for the word and get nothing, because at the byte level it's a different, impossible sequence.

Persian and Urdu had their own version of this: a single misplaced or dropped nukta turning ب into پ, or ی silently encoded as ي. Hindi had matras detached from their consonants. Every script had a handful of these — predictable, rule-shaped failures that no amount of re-running Tesseract would fix, because Tesseract has no idea what the script's rules are. Its job is pixels to characters. It does not know that a dependent vowel cannot begin a word.

Larivaar, padched, and where words begin

There is a second, deeper version of the same problem, and it is unique to Gurbani. Historically the Guru Granth Sahib was written larivaar — one unbroken stream of letters, with no spaces between words at all. Later padched ("word-split") saroops added the spaces to aid reading. OCR of either has to decide where one word ends and the next begins, and it gets that wrong constantly: fusing words that belong apart, or splitting one mid-cluster.

Mool Mantar written larivaar — a continuous stream of Gurmukhi with no spaces between words
Larivaar — no word breaks
The same Mool Mantar written padched — a space between every word
Padched — word-separated
The same Mool Mantar. Word boundaries are precisely what OCR — and any search index built on it — has to get right, and precisely what gurmukhifix reasons about, gated against a verbatim Gurbani lexicon so a real scripture word is never split or rewritten.

Why the obvious fixes didn't work

The tempting first move is a find-and-replace table: "whenever you see X, write Y." We tried versions of that. It was a disaster. A blind substitution rewrites the letters that were already correct, and on a corpus that is mostly correct, that means you corrupt far more than you fix. An early naive pass actually made the text worse than raw Tesseract — by a lot.

The other tempting move is "just train a better model." That helps the recognition step, but it's expensive, needs labelled handwriting data we didn't have, and still leaves the structural Unicode problems untouched.

The idea: correct only with evidence

What finally worked was a different framing. Don't guess. Only change a character when there's evidence that the change makes the text more linguistically valid. If a word is already well-formed, leave it completely alone. If a sihari is sitting where a vowel sign can't legally sit, reorder it — because that move provably resolves a violation. If two letters are genuinely ambiguous and there's no signal which is right, don't flip a coin; flag it for a human.

That principle — evidence-gated correction — became gurmukhifix. It sits after OCR — any engine now: Tesseract, Surya, Gemini, Google Vision — and applies the rules the recognizer can't: reorder the sihari, canonicalise the nukta, normalise to clean NFC Unicode, and flag anything it isn't sure about. Two things make it safe rather than merely clever. A 67,000-word Gurbani lexicon locks verbatim scripture, so a real Gurbani word is never split or rewritten; and every substitution must carry positive evidence — a validity gain or a dictionary hit — so a blind guess between two valid letters is refused, not taken. That no-corruption guarantee is property-tested across every script and the entire Guru Granth Sahib in continuous integration.

Why all these scripts, together

The archives of north-west India don't come neatly sorted by script. A single shelf might hold Gurmukhi scripture, a Persian land record and an Urdu letter. Existing post-processing tools, where they existed at all, covered one script in isolation. We needed one pipeline that understood Gurmukhi, Punjabi, Hindi, Devanagari, Urdu and Farsi — sharing an engine, differing only in their rules. That's what gurmukhifix is.

The result

On 300 real lines of Sri Guru Granth Sahib Ji with the most common OCR error injected, gurmukhifix drives character error rate to zero — and corrupts zero clean lines. It ships with 400+ tests, an honest reproducible benchmark, and a browser demo that will take an image or a whole PDF, OCR it, and clean the result — with nothing ever leaving your machine. The text that comes out is canonical Unicode you can actually search, index and trust.

It is not an OCR engine, and it can't recover handwriting the recognizer fundamentally couldn't read. It is the missing layer between "the OCR ran" and "the text is usable." For anyone trying to make six centuries of a region's writing searchable, that layer turned out to be the whole game.

gurmukhifix is free and open source under the MIT licence. Try the live demo or pip install gurmukhifix — it's on PyPI.