Patrick J. Burns

Asst. Research Scholar at ISAW Library | Fordham PhD, Classics | CLTK contributor

Creating Stopword Lists for Historical Languages

Abstract for paper delivered at the ‘Global Philology: Big Corpora of Historical Text’ seminar, Universität Leipzig.
July 10, 2017

NB: The work presented in this conference paper has since been revised and published in Digital Classics Online 4(2). Please visit this link for the published article.

Abstract

Stopwords are words that are filtered from documents prior to text analysis tasks, usually words that are either high frequency or semantically non-selective. Through the removal of such words, text analysis tasks, including supervised machine learning, clustering, information retrieval, and text summarization, benefit in areas like noise reduction, feature reduction, or speed optimization. There are stopwords lists available online for Latin and Greek, for example via the Perseus Project and stopwords-json, but these lists offer little documentation about their sources or creation. Moreover, since the Perseus list is not published as a self-contained dataset, it does not provide systematic version control or persistent identifiers for proper citation.

I propose to use the time at the Global Philology Big Corpora of Historical Text seminar at Universität Leipzig on July 10-11 for developing best practices for stopword list creation for Latin, Greek, and other historical languages. I will review recent publications on stopword list creation in other languages, replicate corpus-based experiments based on this literature, and work with seminar participants to arrive at the appropriate methods for developing similar lists in our target languages.

The outcome of the seminar would be an article presenting results of stopword list creation methods for Latin as well as a white paper suggesting the best path forward for other languages represented in the Global Philology project. These results, i.e. the stopword list istelf, would also be “published” in 1. the Classical Language Toolkit, 2. a branch of the natural language processing tool spaCy that I am developing for the Latin language, and 3. other open-source venues such as stopwords-json and stop-words. Furthermore, the stop word dataset would be available for use under a CC0 for use in Open Greek and Latin and related projects. By ensuring that the Latin stopword list is properly documented, version controlled, and citable, I aim for the RAD paradigm with this dataset, that is it will be replicable, aggregable, and data-driven, and as such will be better suited to be included in other text analysis projects.

References

Click here for a pdf of this proposal.

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora hcommons