Publications
Topic based language models for OCR correction
Abstract
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by …
- Date
- July 24, 2008
- Authors
- Anurag Bhardwaj, Faisal Farooq, Huaigu Cao, Venu Govindaraju
- Book
- Proceedings of the second workshop on Analytics for noisy unstructured text data
- Pages
- 107-112