Multi-language OCR with optional LLM post-correction
Heratio runs Tesseract over every TIFF / JP2 / PDF page you ingest. Phase 4 adds:
- Multi-language packs. Each information object can be OCR'd against one or more Tesseract language packs at once (English + Afrikaans is the default for SA collections; Zulu, Xhosa, Sotho, Tswana, Tsonga, Venda, Ndebele, Swati, Pedi, Shona, Portuguese, French, Dutch, German, Spanish, Italian, and Latin are all wired in).
- LLM post-correction (opt-in). A second pass over the raw Tesseract
text fixes typical OCR confusions (
rn->m,O<->0,l<->1,cl<->d, broken hyphenation) without paraphrasing. Every edit is recorded so a human can audit it. - PREMIS events. Every OCR run and every LLM correction emits a
preservation_eventrow that surfaces in your PREMIS XML export alongside virus checks, fixity, and format identification.
1. Check what language packs the server has
Run this once after installing Tesseract:
php artisan ahg:tesseract:list-languages
You will see something like:
Tesseract version: tesseract 5.3.4
Installed language packs (12):
afr deu eng fra ita lat
nld osd por spa zul xho
The result is cached into ahg_ai_settings.ocr_languages so the rest
of the application can render dropdowns and validate language specs.
Missing a pack you need? On Debian / Ubuntu:
sudo apt install tesseract-ocr-afr tesseract-ocr-zul tesseract-ocr-xho
Then re-run ahg:tesseract:list-languages to refresh the cache.
2. Pick the default language spec
Settings live under ahg_ai_settings with feature = 'ocr':
| Key | Default | Meaning |
|---|---|---|
ocr_default_languages |
osd+eng+afr |
Tesseract +-joined spec used when no per-IO language is detected. osd adds orientation + script detection. |
ocr_default_psm |
3 |
Page-segmentation mode (3 = fully automatic, 4 = single column of text, 6 = single uniform block). |
ocr_default_oem |
3 |
OCR engine mode (3 = default, LSTM if available). |
ocr_tesseract_binary |
tesseract |
Path to the binary. Override when Tesseract is installed somewhere unusual. |
When an information object has a known language (information_object_i18n.language),
that language wins over the global default. Heratio maps ISO codes like
af, zu, pt onto the right Tesseract packs automatically.
3. Enable LLM post-correction (optional)
Two settings:
| Key | Default | Meaning |
|---|---|---|
ocr_llm_correction_enabled |
0 |
Set to 1 to run the LLM pass after Tesseract. |
ocr_llm_correction_min_confidence |
70 |
Only post-correct pages whose Tesseract mean confidence falls below this number. High-confidence pages are kept untouched to save tokens. |
The LLM model is whichever provider is configured as default in
ahg_llm_config (Ollama, OpenAI, Anthropic, or your AHG cloud
gateway). The prompt is conservative: it only fixes OCR-typical
patterns and refuses to paraphrase. Every correction the model makes
is logged into security_audit_log and into a preservation_event
row of type ocr.llm_correction.
4. OCR a single page from the command line
For debugging / spot-checking:
php artisan ahg:ocr:page /mnt/nas/heratio/archive/page-001.tif
php artisan ahg:ocr:page page.jpg --lang=eng+afr --psm=4
php artisan ahg:ocr:page page.jpg --io=12345 --do=4242 --persist
php artisan ahg:ocr:page page.jpg --llm-correct --json
Flags:
--lang=Tesseract spec (e.g.eng+afr,osd+nld)--psm=/--oem=per-run overrides--io=/--do=link the PREMIS event to an information_object / digital_object--persistwrite the result intoiiif_ocr_text+iiif_ocr_blockso ALTO / hOCR / PAGE-XML export works for that IO--llm-correctforce the LLM pass on for this run even if the global setting is off
5. PREMIS chain of custody
Every OCR run and every LLM correction shows up in the per-IO PREMIS export. To see the audit trail for a record, open it in Heratio and go to Preservation -> PREMIS events, or run:
SELECT event_type, event_outcome, event_detail, linking_agent_value, event_datetime
FROM preservation_event
WHERE information_object_id = <io_id>
AND event_type LIKE 'ocr.%'
ORDER BY event_datetime DESC;
You will see one row per OCR run (ocr.tesseract) and one row per LLM
post-correction (ocr.llm_correction).