Lemmatization is the task of mapping a token in a text to its dictionary headword. With the default LatinCy pipelines, two components are used to perform this task: 1. spaCy’s Edit Tree Lemmatizer and 2. a second custom Lookup Lemmatizer, named in the pipeline “trainable_lemmatizer” and “lookup_lemmatizer” respectively. In the first lemmatization pass, a probabilistic tree model is used to predict the transformation from the token form to its lemma. A second pass is made at the end of the pipeline which checks the token form against a large (~1M item) lemma dictionary (i.e. lookups) for ostensibly unambiguous forms; if a match is found, the lemma is overwritten with the corresponding value from lookup. The two-pass logic largely follows the approach recommended in Burns (2018) and Burns (2020), as facilitated by the spaCy pipeline architecture.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_sm')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Note here the two lemmatizer components that are included in the pipeline, i.e. “trainable_lemmatizer” and “lookup_lemmatizer”…
Once a text is annotated using the LatinCy pipeline, i.e. as part of the Doc creation process, lemmas can be found as annotations of the Token objects…
import tabulatedata = []tokens = [item for item in doc]for token in tokens: data.append([token.text, token.lemma_]) print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
Text Lemma
------------ --------
Haec hic
narrantur narro
a ab
poetis poeta
de de
Perseo Perseus
. .
Perseus perseus
filius filius
erat sum
Iovis Iuppiter
, ,
maximi magnus
deorum deus
. .
Avus auus
eius is
Acrisius Acrisius
appellabatur appello
. .
The lemma_ attribute has the type str and so is compatible with all string operations…
The lemma_ attribute, though, is only the human-readable version of the lemma. Internally, spaCy uses a hash value to represent the lemma, which is stored in the lemma attribute. (Note the lack of trailing underscore.)
In order to compare the two different lemmatization passes, we can create two copies of the LatinCy pipeline, each with one of the two lemmatizers removed…
Text Trainable lemmatizer Lookup lemmatizer
------------ ---------------------- -------------------
Haec hic
narrantur narro narro
a ab
poetis poeta poeta
de de
Perseo Perseus
. . .
Perseus perseus
filius filius filius
erat sum sum
Iovis Iovis Iuppiter
, , ,
maximi magnus
deorum deus deus
. . .
Avus avus auus
eius is is
Acrisius Acrisius
appellabatur appello appello
. . .
Even if the lookup lemmatizer overwrites the lemma predicted by the trainable lemmatizer, the original lemma predicted by the trainable lemmatizer is still stored in a custom Token attribute, i.e.
Token: Iovis
Lemma: Iuppiter
Predicted lemma: Iovis
Note specifically the lemmatization of Iovis—since it is a highly irregular form, it is not surprising that the Edit Tree Lemmatizer has manufactured a potential, but incorrect, lemma based on the root Iov-. Since Iovis is an unambiguous Latin form and has been added to the LatinCy lookups, the Lookup Lemmatizer steps in to correct the erroneous first pass. The lookup data can be found a custom form of the spaCy spacy-lookups-data package [https://github.com/diyclassics/spacy-lookups-data/tree/master/spacy_lookups_data/data]. These lookups are installed as a dependency of each of the LatinCy pipelines.
The code below shows how to access the lookup data directly…
What is described above is the default role of lemmatizers in the LatinCy pipeline. Of course, it may be the case that you want to add additional lookups to your annotation pipeline overriding the default lookups. Here is an example of such a set up using a custom Language.factory. We will cover custom factories in more detail in a later notebook.
First, let’s lemmatizer a sentence with some unexpected, non-Classical Latin words usind the default settings…
# cf. https://la.wikipedia.org/wiki/Star_Wars#Episodium_VI:_Iedi_reducesdoc = nlp("In planeta Tatooine, sunt Principissa Leia et Lucas Skywalker noviter Jedi factus")data = []for token in doc: data.append([token.text, token.lemma_])print("Before...", end="\n\n")print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
Before...
Text Lemma
----------- -----------
In in
planeta planeta
Tatooine Tatooina
, ,
sunt sum
Principissa principissa
Leia Leia
et et
Lucas Lucas
Skywalker Skywalker
noviter noviter
Jedi Jedus
factus facio
Now we will lemmatize again but with a custom lookup table (i.e. dict) added to the pipeline…
from spacy import Languagecustom_lemma_lookups = {"Jedi": ["Jedi"],"Tatooine": ["Tatooine"],"Leia": ["Leia", "Leiae", "Leiam"],}custom_lookups = {}for key, values in custom_lemma_lookups.items():for value in values:if value notin lookups: custom_lookups[value] = key@Language.component(name="custom_lookup_lemmatizer")def make_lookup_lemmatizer_function(doc):for token in doc: token.lemma_ = custom_lookups.get(token.text, token.lemma_)return doctry: nlp.add_pipe("custom_lookup_lemmatizer", name="custom_lookup_lemmatizer")except:# If the pipeline component is already added, we can't add it againpass# cf. https://la.wikipedia.org/wiki/Star_Wars#Episodium_VI:_Iedi_reducesdoc = nlp("In planeta Tatooine, sunt Principissa Leia et Lucas Skywalker noviter Jedi factus")data = []for token in doc: data.append([token.text, token.lemma_])print("After...", end="\n\n")print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
After...
Text Lemma
----------- -----------
In in
planeta planeta
Tatooine Tatooine
, ,
sunt sum
Principissa principissa
Leia Leia
et et
Lucas Lucas
Skywalker Skywalker
noviter noviter
Jedi Jedi
factus facio
Note in particular how the default lemmatizers handled “Leia” without issue (being closer to an expected noun form ending in -a), but now better handles “Tatooine” and “Jedi” because of our custom lookup table.
References
NLTK Chapter 3, Section 3.6 “Normalizing text” link SLP Chapter 2, Section 2.6 “Word normalization, lemmatization and stemming”, pp. 23-24 link spaCy EditTreeLemmatizer link
Burns, Patrick J. 2018. “Backoff Lemmatization as a PhilologicalMethod.” In Digital Humanities 2018, DH 2018, Book of Abstracts, ElColegio de México, UNAM, and RedHD, MexicoCity, Mexico, June 26-29, 2018.
———. 2020. “Ensemble Lemmatization with the ClassicalLanguageToolkit.”Studi e Saggi Linguistici 58 (1): 157–76. https://doi.org/10.4454/ssl.v58i1.273.