Patrick J. Burns

Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer

Intertextuality and Latin Language Models, or a Grammar of Subword Allusion

Paper presented at Classical Texts in Digital Media. U. of Patras. Patras, Greece.

Abstract

In his monograph Repetition in Latin Poetry, Jeffrey Wills notes that diction, and repeated diction specifically, has been a focus of Latin literary criticism since at least Servius, and goes on to argue that critics should also pay attention to a larger “grammar of allusion,” that is an accounting of the full range of features that “[enable] a poet to signal reference in a multitude of ways” (1996: 31). Computational innovations in working with digitized Latin texts now offer philologists access to ever-expanding methods for extracting and taking accounting of textual features beyond word-to-word comparisons (Forstall/Scheirer 2019). Language models, such as fasttext or BERT (Bojanowski et al. 2017; Devlin et al. 2019), work with a “vocabulary,” not of words or wordforms, but of probabilistically disassembled word parts—that is, subwords—that can be combined and recombined as necessary for various text analysis tasks. These subwords can be represented as dense numerical vectors that can better account for their contextual relationship to other subwords than can individual words. This paper takes up Wills’ challenge of a non-diction-based approach to allusion in Latin poetry by computing line-level subword vector similarity as a measure of how intertextual two hexameters are. This similarity can be computed regardless of how many words two lines share or whether they have overlapping diction at all. I demonstrate here TessAnnoy, a vector index of verse lines from a large collection of Latin hexameter texts that can effectively return “nearest neighbor” lines, that is other lines from the collection with the most similar mean vectors. I compare four vector indices—word vectors, character vectors, static subword vectors, and contextual subwords vectors—built using LatinCy (Burns 2023), a collection of Latin language models I have trained for use with the spaCy platform (Honnibal/Montani 2023), and indexed with Annoy (Bernhardsson 2023). Based on this comparison, I argue that, while word vectors offer a robust extension of purely diction-based methods (those most thoroughly studied in computational Latin literary criticism as for example with Tesserae [Coffee et al. 2012]), subword vectors better fall within Wills’ more abstract and more expansive definition of an allusive grammar. Accordingly, subword indices capture in a systematic and comprehensive manner what Gian Biagio Conte (1986: 31) referred to as a “series of phenomena that could be otherwise registered only piecemeal, in uncoordinated, discrete details,” that is the “code model,” his non-diction based framework for text-to-text relationships in Latin literary criticism. Subword vector similarity is in this respect a powerful tool in the Latin philologist’s toolbox, making measurable and comparable textual (and intertextual) details that have been previously available to human-scale philology and individual reading experience.

Works Cited

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora hcommons