OCR post-processing engine

Clean Unicode for
handwritten Gurmukhi & Indic scripts

OCR mangles connected scripts: sihari lands before its consonant, nuktas drift, diacritics scatter. gurmukhifix repairs OCR output — from Tesseract, Surya, Gemini or any engine — into well-formed Unicode. Gurmukhi and Indic scripts first (Urdu & Farsi are experimental), and it never corrupts text that was already correct — including Gurbani.

Try the live demo pip install gurmukhifix

0.00corrected CER on 300 real SGGS lines

6scripts — Gurmukhi/Indic first

0silent corruptions, property-tested

Raw OCRਿਸੱਖ ਧਰਮ

↓ gurmukhifix

Unicodeਸਿੱਖ ਧਰਮ

Sihari ਿ reordered after its base consonant — the #1 systematic Gurmukhi OCR error.

Interactive playground

Paste raw OCR text and watch it become clean, well-formed Unicode. Everything runs in your browser.

This is a lightweight in-browser preview covering Gurmukhi, Punjabi, Hindi and Devanagari. The installable package is authoritative — it adds Gurbani dictionary-gating, a verbatim-scripture lock, and Urdu/Farsi. pip install gurmukhifix ↗

Input — raw OCR

Output — clean Unicode

How it works

gurmukhifix is a post-processor, not an OCR engine. Tesseract turns the image into characters; gurmukhifix applies the linguistic rules Tesseract can't.

Image → OCR

Run any engine — Tesseract (TSV/hOCR), Surya, Gemini, Google Vision. gurmukhifix reads them all.

→

Confidence routing

≥85% passes through, <60% is flagged, the middle band is corrected.

→

Evidence-gated repair

A fix is applied only if it lowers script-validity badness — correct text is never changed.

→

Clean Unicode

Corrected text, a per-fix report and preserved layout metadata.

Sihari reordering

The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.

Nukta canonicalisation

A nukta after a vowel sign (ਸਾ਼) is reordered to the canonical consonant+nukta+vowel (ਸ਼ਾ).

Never corrupts good text

Corrections require validity evidence. Already-correct Unicode round-trips byte-for-byte — enforced by CI.

Validity report

Orphaned matras, impossible sequences and out-of-script code-points are surfaced with severity.

Batch + learning

Parallel batch processing and a SQLite store that promotes repeatedly-confirmed corrections.

Layout preserved

Bounding boxes flow through end-to-end so downstream tools can rebuild the page.

Larivaar & padched — where words begin

Gurbani was written larivaar, one unbroken stream of letters, and later padched with a space between each word. Deciding where words begin is exactly what OCR gets wrong — and what gurmukhifix reasons about, gated against a verbatim Gurbani lexicon so a real scripture word is never split or rewritten.

The Mool Mantar written larivaar — a continuous stream of Gurmukhi with no spaces between words — Larivaar — no word breaks

The same Mool Mantar written padched — a space between every word — Larivaar — no word breaks

Six scripts, one pipeline

One shared engine, per-script rules via extends. Gurmukhi, Punjabi, Hindi and Devanagari run in the demo above; Urdu & Farsi ship in the package as experimental (structural-only). Click any script for a plain-English deep-dive.

In the demoੴ

Get started

On PyPI, MIT-licensed and free for anyone. gurmukhifix reads output from any OCR engine — Tesseract, Surya, Gemini or Google Vision.

Install

pip install gurmukhifix

Run Tesseract → gurmukhifix

tesseract page.png out --oem 1 --psm 6 tsv
gurmukhifix correct --input out.tsv \
  --lang gurmukhi --output ./results

Batch a folder

gurmukhifix batch --input-dir ./pages \
  --lang devanagari --workers 4

Clean Unicode forhandwritten Gurmukhi & Indic scripts