Digitization and search: A non-traditional use of HPC

Abstract

Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required preprocessing steps and the open source …

Date: October 8, 2012
Authors: Liana Diesendruck, Luigi Marini, Rob Kooper, Mayank Kejriwal, Kenton McHenry
Conference: 2012 IEEE 8th International Conference on E-Science
Pages: 1-6
Publisher: IEEE

View Paper

Information Sciences Institute

Publications

Digitization and search: A non-traditional use of HPC

Abstract