TY - GEN
T1 - The Old Bailey and OCR
T2 - 20th ACM Symposium on Document Engineering, DocEng 2020
AU - Ughetta, William
AU - Kernighan, Brian W.
N1 - Funding Information: The authors are grateful for help and advice from David Brailsford, Zoe LeBlanc, Sharon Howard, Tim Hitchcock, Mikki Hornstein, Chris Miller, Rasool Tyler, and Jack Brassil, and for funding from Princeton SEAS and GCP credits. Publisher Copyright: © 2020 ACM.
PY - 2020/9/29
Y1 - 2020/9/29
N2 - The Proceedings of the Old Bailey is a corpus of over 180,000 page images of court records printed from April 1674 to April 1913 and presents a comprehensive challenge for Optical Character Recognition (OCR) services. The Old Bailey is an ideal benchmark for historical document OCR, representing more than two centuries of variations in documents, including spellings, formats, and printing and preservation qualities. In addition to its historical and sociological significance, the Old Bailey is filled with imperfections that reflect the reality of coping with large-scale historical data. Most importantly, the Old Bailey contains human transcriptions for each page, which can be used to help measure OCR accuracy. Since humans do make mistakes in transcriptions, the relative performance of OCR services will be more informative than their absolute performance. This paper compares three leading commercial OCR cloud services: Amazon Web Services's Textract (AWS); Microsoft Azure's Cognitive Services (Azure); and Google Cloud Platform's Vision (GCP). Benchmarking involved downloading over 180,000 images, executing the OCR, and measuring the error rate of the OCR text against the human transcriptions. Our results found that AWS had the lowest median error rate, Azure had the lowest median round trip time, and GCP had the best combination of a low error rate and a low duration.
AB - The Proceedings of the Old Bailey is a corpus of over 180,000 page images of court records printed from April 1674 to April 1913 and presents a comprehensive challenge for Optical Character Recognition (OCR) services. The Old Bailey is an ideal benchmark for historical document OCR, representing more than two centuries of variations in documents, including spellings, formats, and printing and preservation qualities. In addition to its historical and sociological significance, the Old Bailey is filled with imperfections that reflect the reality of coping with large-scale historical data. Most importantly, the Old Bailey contains human transcriptions for each page, which can be used to help measure OCR accuracy. Since humans do make mistakes in transcriptions, the relative performance of OCR services will be more informative than their absolute performance. This paper compares three leading commercial OCR cloud services: Amazon Web Services's Textract (AWS); Microsoft Azure's Cognitive Services (Azure); and Google Cloud Platform's Vision (GCP). Benchmarking involved downloading over 180,000 images, executing the OCR, and measuring the error rate of the OCR text against the human transcriptions. Our results found that AWS had the lowest median error rate, Azure had the lowest median round trip time, and GCP had the best combination of a low error rate and a low duration.
KW - Amazon Web Services
KW - Google Cloud Platform
KW - Historical Documents
KW - Microsoft Azure
KW - Old Bailey
KW - Optical Character Recognition
UR - http://www.scopus.com/inward/record.url?scp=85093095311&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093095311&partnerID=8YFLogxK
U2 - 10.1145/3395027.3419595
DO - 10.1145/3395027.3419595
M3 - Conference contribution
T3 - Proceedings of the ACM Symposium on Document Engineering, DocEng 2020
BT - Proceedings of the ACM Symposium on Document Engineering, DocEng 2020
PB - Association for Computing Machinery, Inc
Y2 - 29 September 2020 through 1 October 2020
ER -