Patrick J. Burns

Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer

Recovering 34 Billion Latin Words from AI Training Data: Or Philology’s Collaborative Demands at Computational Scale

Abstract for paper delivered at SCS2026.
Written with D. Bamman, C. Brooks, M. Hudspeth, and B. O’Connor.

Keywords

Latin philology, computational thinking, large language models

Abstract

When researchers in artificial intelligence (Langlais, Stasenko, and Arnett 2024) release a text repository advertising 34 billion Latin tokens—a number over 5,000 times larger than a comprehensive repository of canonically classical Latin like the Perseus Digital Library—how are philologists supposed to assess the contents of such a collection? How are Latinists expected to use such a collection? The number is so outstandingly large relative to other Latin collections that, as we argue in this talk, it requires collaboration with colleagues in computer science and information science to even have an entry point into the question (Crane et al. 2014). This talk will describe the Latin content of an LLM training data repository, with particular attention to how we navigate its almost 150GB of files and how we deal with the massive amount of textual corruption, high rates of duplicates, and other concerns raised when working with these volumes. Researchers have shown that, in spite of such concerns, similarly large data can be used to train state-of-the-art Latin language models (Bamman/Burns 2020; Riemenschneider/Frank 2023; Hudspeth, Burns, and O’Connor 2025). Researchers have also shown the basic value of large-scale quantitative description and assessment of available Latin textual resources (Bamman/Smith 2012; Burns 2023; Hudspeth, O’Connor, and Thompson 2024). Still, more exploration is needed to understand the trade-off of quantity versus quality for the language. Accordingly, we discuss ways to filter low-quality, “noisy” texts and entertain ideas about at-scale OCR correction and related computational mitigations on the remaining texts (Smith/Cordell 2023; Cowen-Breen et al. 2023), as well as prospects for enriching these corpora with metadata (e.g. author, genre, or historical period), which could aid deeper philological investigation. In undertaking this study, we further take the following interdisciplinary position: working with billions of Latin words is an intellectual endeavor that requires both philological method and computational method, philological thinking and computational thinking (Wing 2006). It should relate to the broader need to approach machine learning data collection through a sociocultural archival lens (Jo/Gebru 2020; Desai et al. 2024), joining other work on characterizing implicit or undocumented data curation decisions behind web-based LLM training data and available models (Dodge et al. 2021; Soldaini et al. 2024). In this respect, the talk sets an agenda for computational philology in our current LLM-focused environment.

Works Cited

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora hcommons