Patrick J. Burns
Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer
Rebuilding the Library of Al3xandr!a: Latin Post-OCR Correction as a Philological Task
Abstract for conference forthcoming
Abstract
The “unfathomable” training data used to train large language models (LLMs) contains a similarly unfathomable amount of Latin text, much of which derives from scanned volumes of Neo-Latin text. The recently released common_corpus from Pleias advertises 34 billion Latin tokens (Langlais, Stasenko, and Arnett 2024), a number over 5000 times larger than the Perseus Digital Library, over 100 times larger than the Corpus Corporum; but on closer inspection, due to the extremely variable quality of optical character recognition (OCR), we are as likely to find in this collection running text like eadem dicta esse repertum sit as we are to find fim ego exf€r$mJkm)iMdem (Bamman n.d.). This paper will analyze the Latin—and Latin-ish—content of common_corpus with an eye toward the degree to which LLM-assisted post-OCR correction (cf. e.g. Thomas, Gaizauskas, and Lu 2024) can be used to recover corrupted Latin text from these modern-day Libraries of Alexandria (Kahle 2021). If a defining activity of philology can be considered, as James Zetzel (Zetzel 2015) writes, “reconstructing what [was] written rather than enshrining or embalming the errors transmitted,” post-OCR correction may turn out to be, I argue here, a defining (computational) philological activity for Latin in the LLM era.
Works Cited
- Bamman, D. n.d. “11K Latin Texts.” http://www.cs.cmu.edu/~dbamman/latin.html.
- Kahle, B. 2021. “I Set Out to Build the Next Library of Alexandria. Now I Wonder: Will There Be Libraries in 25 Years?” Time. Oct. 22. https://time.com/6108581/internet-archive-future-books/.
- Langlais, P.-C., Stasenko, A., and Arnett, C. 2024. “Releasing the Largest Multilingual Open Pretraining Dataset.” Hugging Face. Nov. 13. https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open.
- Thomas, A., Gaizauskas, R., and Lu, H. 2024. “Leveraging LLMs for Post-OCR Correction of Historical Newspapers.” In Sprugnoli, R. and Passarotti, M. eds. Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024: 116–21. https://aclanthology.org/2024.lt4hala-1.14.
- Zetzel, J.E.G. 2015. “The Bride of Mercury: Confessions of a ’Pataphilologist.” In Pollock, S. ed. World Philology. Cambridge, MA: Harvard University Press. 45–62.