
Laura Vangone
Post-doctoral researcher at the University of Bologna
less
Related Authors
Tommaso Manzon
International Baptist Theological Study Centre Amsterdam
Alejandro García Morilla
Universidad Complutense de Madrid
Antonin Charrié-Benoist
Ecole Pratique des Hautes Etudes
Carlo Ebanista
Università del Molise
Clelia Crialesi
Centre National de la Recherche Scientifique / French National Centre for Scientific Research
Agostino Paravicini Bagliani
University of Lausanne
STEFANO GHIROLDI
Università degli Studi di Bergamo (University of Bergamo)
Gianmarco De Angelis
Università degli Studi di Padova
Noemi Pigini
Consiglio Nazionale delle Ricerche (CNR)
Simone Marcenaro
Università del Molise
InterestsView All (28)
Uploads
Papers by Laura Vangone
In the first stage of the project, that started in the mid-2018, we are selecting the texts to be included in the corpus, basing on the metadata in the electronic database of Medieval Latin texts that is, at the moment, the largest scholarly-driven source of information of this kind available free on the Internet. Once selected, the texts are retrieved from existing collections and digital libraries.
As early tests showed, less than a half of the texts already exist in interoperable formats such as TEI XML, or at least in a form that allows for easy conversion which does not require human intervention. This means that the bulk of the corpus texts has to be acquired from digital images of editions available on-line through OCR and post-processing. For both tasks, there now exists a broad range of efficient tools, and many sophisticated workflows were proposed in literature. However, the presented project is significantly limited when it comes to its resources, since one person is expected to work on controlling the process and improving OCR quality during a single year.
In the presentation we would like, first, to demonstrate the workflow of the project which, at the moment, consists of the 1) image extraction from PDF files, 2) image cleaning, and its subsequent 3) OCR, followed by 4) the batch-correction of the OCR errors, and 5) the removal of the non-Latin text with a simple classifier. The tools we use are all free and open source, an important factor in a project which is low on resources but ambitious in its goals. The PDF extraction and conversion are performed with Linux 'convert' and 'pdfimages' commands. The output TIFFs are cleaned with the "ScanTailor", while the OCR is realised with "Tesseract". To save on time, the entire workflow is automated, with the human analyst verifying the quality of the output and mass-correcting OCR errors with the "Post Correction Tool".
Apart from presenting the project and the workflow, the paper will discuss the challenges we have faced. One of the most problematic issues turned out to be the relatively disparate quality of the image files retrieved from online sources. Another factor that significantly hinders the automatic processing was the quality of text editions.
In the first stage of the project, that started in the mid-2018, we are selecting the texts to be included in the corpus, basing on the metadata in the electronic database of Medieval Latin texts that is, at the moment, the largest scholarly-driven source of information of this kind available free on the Internet. Once selected, the texts are retrieved from existing collections and digital libraries.
As early tests showed, less than a half of the texts already exist in interoperable formats such as TEI XML, or at least in a form that allows for easy conversion which does not require human intervention. This means that the bulk of the corpus texts has to be acquired from digital images of editions available on-line through OCR and post-processing. For both tasks, there now exists a broad range of efficient tools, and many sophisticated workflows were proposed in literature. However, the presented project is significantly limited when it comes to its resources, since one person is expected to work on controlling the process and improving OCR quality during a single year.
In the presentation we would like, first, to demonstrate the workflow of the project which, at the moment, consists of the 1) image extraction from PDF files, 2) image cleaning, and its subsequent 3) OCR, followed by 4) the batch-correction of the OCR errors, and 5) the removal of the non-Latin text with a simple classifier. The tools we use are all free and open source, an important factor in a project which is low on resources but ambitious in its goals. The PDF extraction and conversion are performed with Linux 'convert' and 'pdfimages' commands. The output TIFFs are cleaned with the "ScanTailor", while the OCR is realised with "Tesseract". To save on time, the entire workflow is automated, with the human analyst verifying the quality of the output and mass-correcting OCR errors with the "Post Correction Tool".
Apart from presenting the project and the workflow, the paper will discuss the challenges we have faced. One of the most problematic issues turned out to be the relatively disparate quality of the image files retrieved from online sources. Another factor that significantly hinders the automatic processing was the quality of text editions.
Medioevo latino. Metodologie e tecniche bibliografiche
Firenze, 1 ottobre 2020