Creating Stopword Lists for Historical Languages
Abstract for article published in Digital Classics Online 4(2)
December 6, 2018
Stoplists are lists of words that have been filtered from documents prior to text analysis tasks, usually words that are either high frequency or that have low semantic value. This paper describes the development of a generalizable method for building stoplists in the Classical Language Toolkit (CLTK), an open-source Python platform for natural language processing research on historical languages. Stoplists are not readily available for many historical languages, and those that are available often offer little documentation about their sources or method of construction. The development of a generalizable method for building historical-language stoplists offers the following benefits: 1. better support for well-documented, data-driven, and replicable results in the use of CLTK resources; 2. reduction of arbitrary decision-making in building stoplists; 3. increased consistency in how stopwords are extracted from documents across multiple languages; and 4. clearer guidelines and standards for CLTK developers and contributors, a helpful step forward in managing the complexity of a multi-language open-source project.
Click here for the full text of this article.
NB: This article is a continuation of work first presented as “Creating Stopword Lists for Historical Languages” for the Global Philology: Big Corpora of Historical Text conference at Universität Leipzig in July 2017.