Patrick J. Burns

Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer

How to Read Latin Like a Computer: The Philology of Latin Word Sense Disambiguation

Abstract for workshop at TLL on October 20.

Abstract

Cum/with vs. cum/when. Levis/light vs. levis/smooth. Ius/law vs. ius/broth. Homographs present a vocabulary challenge to the emergent Latin reader. They also continue to challenge accurate computational “reading” of the Latin language, affecting natural language processing (NLP) tasks such as lemmatization, part-of-speech tagging, and syntactical parsing, among others. So too, word sense disambiguation. As W. G. Hale writes in The Art of Reading Latin about the student experience of seeing the word ut: “How will I translate it? There are some half-dozen or more ‘meanings’: which does it have here?” Again a processing challenge to both the human and the computer.

This presentation, drawn from my work-in-progress book project How to Read Latin Like a Computer, looks at the problem of word sense disambiguation from a philological and lexicographical perspective, inviting such questions as: What are the strategies that people use to read Latin and especially to understand ambiguous lexical situations encountered while reading? What are the strategies that computer models use to “read”—that is, to process—Latin, and in particular to “understand” Latin semantics? And what we can learn about one from the other?

In this talk, I will briefly review important concepts from the history of NLP-driven word sense disambiguation, including the Lesk algorithm, the “one sense per discourse” approach, and corpus-derived clustering, to name a few. I will also cover the state of Latin NLP, with particular attention to distributional approaches to Latin semantics. The goal of the talk is to get feedback from Latin philologists and lexicographers on how a comparative—i.e. human vs. computational—approach to “reading” Latin can be applied to the hundreds of millions, if not billions, of words of Latin available online that have yet to be systematically curated, classified, and catalogued.