Patrick J. Burns

Associate Research Scholar, Digital Projects @ Institute for the Study of the Ancient World / NYU | Formerly Culture Cognition, and Coevolution Lab (Harvard) & Quantitative Criticism Lab (UT-Austin) | Fordham PhD, Classics | LatinCy developer

Streamlining historical-language text processing with CLTK Readers

Demonstration of CLTK Readers at Ancient Makerspaces 2023

Abstract

Along with the mass of digitized Latin texts now available to researchers for computational analysis, there exists a number of different formats, markup strategies, encodings, and various editorial decisions that can make it difficult to incorporate texts from various sources into research projects without considerable preparatory work. In this workshop, I demonstrate use of CLTK Readers, a Python-based solution to streamlining the process of working with different collections of Ancient Greek and Latin texts. CLTK Readers consists of two primary parts: 1. a series of corpus readers that transform documents into a common format for analyzable units, such as documents, paragraphs, sentences, lines, and words, with programmatic flexibility for segmentation, tokenization, lemmatization, and other kinds of text annotation and built-in support for the Classical Language Toolkit; and 2. a collection of preconfigured classification, clustering, and visualization pipelines for easily setting up text-analysis experiments based on the output of the corpus readers. Corpus readers are currently available for the CLTK Tesserae Ancient Greek and Latin Corpora and the Universal Dependencies treebanks, with readers for additional collections and formats in development.