On February 18, 2025, Joost Reeder will defend his Master's Thesis on improving OCR.
Abstract
Optical Character Recognition (OCR) is widely used in bureaucratic digitization for the recognition of scanned forms. It is also used in the historical field to digitize paper sources for preservation. The analysis of large corpora of data is often only practical in digital form. It is therefore important that the digital copies have as few errors as possible. Artificial Neural Networks are trained to find text in an image and to recognize characters as well as possible, which can be difficult in cases of bad handwriting or poor resolution. There has been a long history of research into improving OCR results by various means. One of these means is the use of Language Models. They can be trained to reproduce statistical character or word frequencies. Recent advances in transformer architecture have introduced Large Language Models (LLM), which are sophisticated enough to generate whole pages of correct language, making them a suitable candidate for text correction. As Large Language Models become more readily available and feasible to use for a wider audience, the question arises as to whether they can be used to optimize OCR output. Current approaches focusing on transformer architecture include prompting LLMs, voting on different OCR outputs with LLMs, or fine-tuningLLMs for OCR post-correction.This thesis combines the prompting approach with OCR probabilities to produce corrected text. Problematic hallucinations in the prompting approach are mitigated, but corrections can be obtained that would not be presented by any OCR result without the use of context. The OCR probabilities are incorporated into the token selection process of the LLM. The right trade-off between LLM token probability and OCR probability needs to be found through experimentation. The focus is on coherent paragraphs of handwritten text, as opposed to handwritten notes, machine written text or form data. The availability of paragraph context favors the ability of LLMs to take context into account when generating tokens. As recent work has observed a correlation between prompt and performance, prompt engineering will be considered and experiments willdetermine the best type of prompt. The results will be evaluated using a specific parameterized character error rate (CER) criterion, which allows reading order to be ignored and different segmentations of the lines read to be used.