Patrick J. Burns

Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer

Rebuilding the Library of Al3xandr!a: Latin Post-OCR Correction as a Philological Task

Abstract for conference forthcoming

Abstract

The “unfathomable” training data used to train large language models (LLMs) contains a similarly unfathomable amount of Latin text, much of which derives from scanned volumes of Neo-Latin text. The recently released common_corpus from Pleias advertises 34 billion Latin tokens (Langlais, Stasenko, and Arnett 2024), a number over 5000 times larger than the Perseus Digital Library, over 100 times larger than the Corpus Corporum; but on closer inspection, due to the extremely variable quality of optical character recognition (OCR), we are as likely to find in this collection running text like eadem dicta esse repertum sit as we are to find fim ego exf€r$mJkm)iMdem (Bamman n.d.). This paper will analyze the Latin—and Latin-ish—content of common_corpus with an eye toward the degree to which LLM-assisted post-OCR correction (cf. e.g. Thomas, Gaizauskas, and Lu 2024) can be used to recover corrupted Latin text from these modern-day Libraries of Alexandria (Kahle 2021). If a defining activity of philology can be considered, as James Zetzel (Zetzel 2015) writes, “reconstructing what [was] written rather than enshrining or embalming the errors transmitted,” post-OCR correction may turn out to be, I argue here, a defining (computational) philological activity for Latin in the LLM era.

Works Cited

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora hcommons