OCR post-processing engine

Clean Unicode for
handwritten Gurmukhi & Indic scripts

OCR mangles connected scripts: sihari lands before its consonant, nuktas drift, diacritics scatter. gurmukhifix repairs OCR output — from Tesseract, Surya, Gemini or any engine — into well-formed Unicode. Gurmukhi and Indic scripts first (Urdu & Farsi are experimental), and it never corrupts text that was already correct — including Gurbani.

0.00corrected CER on 300 real SGGS lines
6scripts — Gurmukhi/Indic first
0silent corruptions, property-tested
Raw OCRਿਸੱਖ ਧਰਮ
↓ gurmukhifix
Unicodeਸਿੱਖ ਧਰਮ

Sihari ਿ reordered after its base consonant — the #1 systematic Gurmukhi OCR error.

Interactive playground

Paste raw OCR text and watch it become clean, well-formed Unicode. Everything runs in your browser.

This is a lightweight in-browser preview covering Gurmukhi, Punjabi, Hindi and Devanagari. The installable package is authoritative — it adds Gurbani dictionary-gating, a verbatim-scripture lock, and Urdu/Farsi. pip install gurmukhifix ↗

Input — raw OCR
Output — clean Unicode

How it works

gurmukhifix is a post-processor, not an OCR engine. Tesseract turns the image into characters; gurmukhifix applies the linguistic rules Tesseract can't.

1

Image → OCR

Run any engine — Tesseract (TSV/hOCR), Surya, Gemini, Google Vision. gurmukhifix reads them all.

2

Confidence routing

≥85% passes through, <60% is flagged, the middle band is corrected.

3

Evidence-gated repair

A fix is applied only if it lowers script-validity badness — correct text is never changed.

4

Clean Unicode

Corrected text, a per-fix report and preserved layout metadata.

Sihari reordering

The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.

Nukta canonicalisation

A nukta after a vowel sign (ਸਾ਼) is reordered to the canonical consonant+nukta+vowel (ਸ਼ਾ).

Never corrupts good text

Corrections require validity evidence. Already-correct Unicode round-trips byte-for-byte — enforced by CI.

Validity report

Orphaned matras, impossible sequences and out-of-script code-points are surfaced with severity.

Batch + learning

Parallel batch processing and a SQLite store that promotes repeatedly-confirmed corrections.

Layout preserved

Bounding boxes flow through end-to-end so downstream tools can rebuild the page.

Larivaar & padched — where words begin

Gurbani was written larivaar, one unbroken stream of letters, and later padched with a space between each word. Deciding where words begin is exactly what OCR gets wrong — and what gurmukhifix reasons about, gated against a verbatim Gurbani lexicon so a real scripture word is never split or rewritten.

The Mool Mantar written larivaar — a continuous stream of Gurmukhi with no spaces between words
Larivaar — no word breaks
The same Mool Mantar written padched — a space between every word
Padched — word-separated

Six scripts, one pipeline

One shared engine, per-script rules via extends. Gurmukhi, Punjabi, Hindi and Devanagari run in the demo above; Urdu & Farsi ship in the package as experimental (structural-only). Click any script for a plain-English deep-dive.

In the demo

Gurmukhi

The script of Sikh scripture and one of the writing systems for Punjabi.

Deep-dive →
In the demoਪੰ

Punjabi

Punjabi written in the Gurmukhi script — it builds on every Gurmukhi rule.

Deep-dive →
In the demoहि

Hindi

Hindi written in the Devanagari script.

Deep-dive →
In the demoदे

Devanagari

The shared base script behind Hindi, Marathi, Nepali and Sanskrit.

Deep-dive →
Experimentalاُ

Urdu

Urdu in the Nasta'liq style — a connected, right-to-left script.

Deep-dive →
Experimentalفا

Farsi

Persian (Farsi) — Arabic-script with Persian-specific letters.

Deep-dive →

Get started

On PyPI, MIT-licensed and free for anyone. gurmukhifix reads output from any OCR engine — Tesseract, Surya, Gemini or Google Vision.

gurmukhifix on PyPI Supported Python versions MIT licence

Install
pip install gurmukhifix
Run Tesseract → gurmukhifix
tesseract page.png out --oem 1 --psm 6 tsv
gurmukhifix correct --input out.tsv \
  --lang gurmukhi --output ./results
Batch a folder
gurmukhifix batch --input-dir ./pages \
  --lang devanagari --workers 4