Skip to content
BY-NC-ND 4.0 license Open Access Published by De Gruyter Saur 2019

Optical Character Recognition for Classical Philology

Bruce Robertson

Abstract

This paper explains the technology behind recent improvements in optical character recognition and how it can be attuned to produce highly accurate texts of scholarly value, especially when dealing with difficult scripts like ancient Greek. Drawing upon several practical experiments using the Ciaconna OCR system (itself based on OCRopus), it shows: the impact of Unicode normalized forms on recognition accuracy; the importance of removing ambiguously encoded characters from training material; the advantage of using separate classifiers for different scripts; the helpful effects of image augmentation; and the effects of binarization levels. It also describes how Ciaconna embeds information about spell-check and dehyphenation within its output.

© 2019 Walter de Gruyter GmbH, Berlin/Munich/Boston
Downloaded on 5.12.2022 from frontend.live.degruyter.dgbricks.com/document/doi/10.1515/9783110599572-008/html
Scroll Up Arrow