Patrick J. Burns

Research Associate at Harvard Human Evolutionary Biology | Formerly Quantitative Criticism Lab, ISAW Library | Fordham PhD, Classics | CLTK contributor

Streamlining historical-language text processing with CLTK Readers

Demonstration of CLTK Readers at Ancient Makerspaces 2023

Abstract

Along with the mass of digitized Latin texts now available to researchers for computational analysis, there exists a number of different formats, markup strategies, encodings, and various editorial decisions that can make it difficult to incorporate texts from various sources into research projects without considerable preparatory work. In this workshop, I demonstrate use of CLTK Readers, a Python-based solution to streamlining the process of working with different collections of Ancient Greek and Latin texts. CLTK Readers consists of two primary parts: 1. a series of corpus readers that transform documents into a common format for analyzable units, such as documents, paragraphs, sentences, lines, and words, with programmatic flexibility for segmentation, tokenization, lemmatization, and other kinds of text annotation and built-in support for the Classical Language Toolkit; and 2. a collection of preconfigured classification, clustering, and visualization pipelines for easily setting up text-analysis experiments based on the output of the corpus readers. Corpus readers are currently available for the CLTK Tesserae Ancient Greek and Latin Corpora and the Universal Dependencies treebanks, with readers for additional collections and formats in development.

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora hcommons