Handwriting Transcription Using Word Spotting and Humans in the Loop
Brandon Dang · LinkedIn · E-Portfolio
Master's Report/Thesis · Spring 2018
Handwritten materials such as historical document collections are increasingly being digitized and made available for purposes of preservation, scholarly analysis, and text retrieval. To this end, numerous specialized software tools have been developed to support the crowdsourced transcription of such texts. However, as many of these tools operate at the page-level, they may not be suitable for documents containing privacy-sensitive data such as medical records, as this risks the potential of disclosing such information to unintended parties. Additionally, manual transcription efforts can be slow and expensive. Automatic optical character recognition (OCR) methods perform poorly on handwritten text due to the large variability in human handwriting, degradation of historical texts, artifacts of scanning, and other sources of noise. As such, handwritten text analysis remains an active area of research. With the renewed interest in neural networks, recent methods using deep learning have achieved unprecedented state-of-the-art results on benchmark datasets in areas including word recognition, word spotting, and character recognition. Despite this, current methods are not yet robust enough to fully automate handwriting transcription tasks alone. In this work, we aim to combine the efficiency of machine learning with the accuracy of human intelligence to semi-automatically transcribe a challenging real-world dataset of word images segmented from historical handwritten medical records as part of the Central State Hospital Digital Library project. Specifically, we leverage a deep convolutional network to generate a feature set, identify groups of similar images using unsupervised density-based clustering, and obtain cluster transcriptions from human workers on an online crowdsourcing platform. In doing so, we aim to reduce the number of images to be sent to the crowd, thereby optimizing monetary and time costs while still maintaining an acceptable level of accuracy and preserving the privacy of the data.