Patrick J. Burns

Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer

Neo-Latin as a pragmatic source of language model training data

Abstract for IANLS2025 in Aix-en-Provence, France. July 2025.

Abstract

A decade ago, David Bamman (Bamman n.d.) showed the effectiveness of using 11K texts from the Internet Archive to train Latin language models despite the low quality of optical character recognition—something replicated recently in training the first transformer-based Latin model (Bamman and Burns 2020). Notable about these texts is the large number dating after the 15th century, that is, from “the Italian Renaissance up to the modern day” (Butterfield 2012). As such, Neo-Latin texts represent the only pragmatic source of “big” data for training the latest generation of large language models (LLM) for the language. In this paper, I present a project extending Bamman’s earlier efforts in order to train ever-larger Latin models: OMNIA (Omnis Materia Nominata apud Internet Archive). The project has two phases: 1. building a repository of what is now roughly 106K Latin IA texts; and 2. using an open-source LLM to correct the collection’s low-quality OCR. When complete, OMNIA—with wordcounts perhaps measuring in the billions—will be among the largest plaintext Latin collections, including substantial Neo-Latin coverage. Moreover, since OMNIA can be used for training models for downstream text analysis tasks, it should stand as a major milestone in advancing Latin natural language processing for all periods.

Works Cited

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora hcommons