Publications : Information Sciences Institute

Topic based language models for OCR correction

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by …

Date: July 24, 2008
Authors: Anurag Bhardwaj, Faisal Farooq, Huaigu Cao, Venu Govindaraju
Book: Proceedings of the second workshop on Analytics for noisy unstructured text data
Pages: 107-112

View Paper