Lemmatization is the task of mapping a token in a text to its dictionary headword. With the default LatinCy pipelines, two components are used to perform this task: 1. spaCy’s Edit Tree Lemmatizer and 2. a second custom Lookup Lemmatizer, named in the pipeline “trainable_lemmatizer” and “lookup_lemmatizer” respectively. In the first lemmatization pass, a probabilistic tree model is used to predict the transformation from the token form to its lemma. A second pass is made at the end of the pipeline which checks the token form against a large (~1M item) lemma dictionary (i.e. lookups) for ostensibly unambiguous forms; if a match is found, the lemma is overwritten with the corresponding value from lookup. The two-pass logic largely follows the approach recommended in Burns (2018) and Burns (2020), as facilitated by the spaCy pipeline architecture.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_sm')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Note here the two lemmatizer components that are included in the pipeline, i.e. “trainable_lemmatizer” and “lookup_lemmatizer”…
Once a text is annotated using the LatinCy pipeline, i.e. as part of the Doc creation process, lemmas can be found as annotations of the Token objects…
import tabulatedata = []tokens = [item for item in doc]for token in tokens: data.append([token.text, token.lemma_]) print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
Text Lemma
------------ --------
Haec hic
narrantur narro
a ab
poetis poeta
de de
Perseo Perseo
. .
Perseus Perseus
filius filius
erat sum
Iovis Iuppiter
, ,
maximi
deorum deus
. .
Avus auus
eius is
Acrisius Acrisius
appellabatur appello
. .
The lemma_ attribute has the type str and so is compatible with all string operations…
The lemma_ attribute, though, is only the human-readable version of the lemma. Internally, spaCy uses a hash value to represent the lemma, which is stored in the lemma attribute. (Note the lack of trailing underscore.)
In order to compare the two different lemmatization passes, we can create two copies of the LatinCy pipeline, each with one of the two lemmatizers removed…
Text Trainable lemmatizer Lookup lemmatizer
------------ ---------------------- -------------------
Haec hic
narrantur narro narro
a ab
poetis poetus poeta
de de
Perseo Perseo
. . .
Perseus Perseus
filius filius filius
erat sum sum
Iovis Iovis Iuppiter
, , ,
maximi
deorum deus deus
. . .
Avus avus auus
eius is is
Acrisius Acrisius
appellabatur appello appello
. . .
Note specifically the lemmatization of Iovis—since it is a highly irregular form, it is not surprising that the Edit Tree Lemmatizer has manufactured a potential, but incorrect, lemma based on the root Iov-. Since Iovis is an unambiguous Latin form and has been added to the LatinCy lookups, the Lookup Lemmatizer steps in to correct the erroneous first pass. The lookup data can be found a custom form of the spaCy spacy-lookups-data package [https://github.com/diyclassics/spacy-lookups-data/tree/master/spacy_lookups_data/data]. These lookups are installed as a dependency of each of the LatinCy pipelines.
The code below shows how to access the lookup data directly…
NLTK Chapter 3, Section 3.6 “Normalizing text” link SLP Chapter 2, Section 2.6 “Word normalization, lemmatization and stemming”, pp. 23-24 link spaCy EditTreeLemmatizer link
Burns, Patrick J. 2018. “Backoff Lemmatization as a PhilologicalMethod.” In Digital Humanities 2018, DH 2018, Book of Abstracts, ElColegio de México, UNAM, and RedHD, MexicoCity, Mexico, June 26-29, 2018.
———. 2020. “Ensemble Lemmatization with the ClassicalLanguageToolkit.”Studi e Saggi Linguistici 58 (1): 157–76. https://doi.org/10.4454/ssl.v58i1.273.