Patrick J. Burns
Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer
The Digital Afterlife of a Dead Language: Or Recovering 34 Billion(!) Latin Words from AI Training Data
Abstract for public lecture at Taft Center for Humanities, University of Cincinnati
Abstract
Latin has been a perhaps unexpected beneficiary of recently published Large Language Model (LLM) training datasets. For example, an artificial intelligence firm just released a text repository advertising 34 billion Latin tokens—a number over 5,000 times larger than a comprehensive repository of canonically classical Latin like the Perseus Digital Library. The number is so outstandingly large relative to other Latin collections—“unfathomable” in the parlance of AI critique—that it demands a fuller accounting of what it means for humanities scholars to work with such collections, leading us to ask questions like—What novel methods are necessary to explore such a library? How do we handle the massive amount of textual corruption found in these volumes? What tools and models can we build—and build responsibly—with that amount of textual data? In this talk, I will bring in threads from natural language processing, cryptography, and textual criticism, among other disciplines, to redefine philology at scale for our computational, LLM-inflected moment. While the presentation will lead with examples from Latin texts, the talk invites humanities scholars working in any language or literature to reflect on how issues of training data quantity and quality affect their areas of research.