6  Lemmatization

6.1 Lemmatization with LatinCy

Lemmatization is the task of mapping a token in a text to its dictionary headword. With the default LatinCy pipelines, two components are used to perform this task: 1. spaCy’s Edit Tree Lemmatizer and 2. a second custom Lookup Lemmatizer, named in the pipeline “trainable_lemmatizer” and “lookup_lemmatizer” respectively. In the first lemmatization pass, a probabilistic tree model is used to predict the transformation from the token form to its lemma. A second pass is made at the end of the pipeline which checks the token form against a large (~1M item) lemma dictionary (i.e. lookups) for ostensibly unambiguous forms; if a match is found, the lemma is overwritten with the corresponding value from lookup. The two-pass logic largely follows the approach recommended in Burns (2018) and Burns (2020), as facilitated by the spaCy pipeline architecture.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_sm')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Note here the two lemmatizer components that are included in the pipeline, i.e. “trainable_lemmatizer” and “lookup_lemmatizer”…

print(nlp.pipe_names)
['senter', 'normer', 'tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser', 'lookup_lemmatizer', 'ner', 'remorpher']

Once a text is annotated using the LatinCy pipeline, i.e. as part of the Doc creation process, lemmas can be found as annotations of the Token objects…

sample_token = doc[0]

print(f'Sample token: {sample_token.text}')
print(f'Sample lemma: {sample_token.lemma_}')
Sample token: Haec
Sample lemma: hic
import tabulate

data = []

tokens = [item for item in doc]

for token in tokens:
    data.append([token.text, token.lemma_])    

print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
Text          Lemma
------------  --------
Haec          hic
narrantur     narro
a             ab
poetis        poeta
de            de
Perseo        Perseus
.             .
Perseus       perseus
filius        filius
erat          sum
Iovis         Iuppiter
,             ,
maximi        magnus
deorum        deus
.             .
Avus          auus
eius          is
Acrisius      Acrisius
appellabatur  appello
.             .

The lemma_ attribute has the type str and so is compatible with all string operations…

print(f'Token: {tokens[0].text}')
print(f'Lemma: {tokens[0].lemma_}')
print(f'Lowercase lemma: {tokens[0].lemma_.lower()}')
Token: Haec
Lemma: hic
Lowercase lemma: hic

The lemma_ attribute, though, is only the human-readable version of the lemma. Internally, spaCy uses a hash value to represent the lemma, which is stored in the lemma attribute. (Note the lack of trailing underscore.)

print(f'Token: {tokens[1].text}')
print(f'Human-readable lemma: {tokens[1].lemma_}')
print(f'spaCy lemma key: {tokens[1].lemma}')
Token: narrantur
Human-readable lemma: narro
spaCy lemma key: 11361982710182407617

In order to compare the two different lemmatization passes, we can create two copies of the LatinCy pipeline, each with one of the two lemmatizers removed…

import copy

P1 = copy.deepcopy(nlp)
P1.disable_pipes((['tagger', 'morphologizer', 'lookup_lemmatizer']))
print(f'First pipeline components: {P1.pipe_names}')

P2 = copy.deepcopy(nlp)
P2.disable_pipes((['tagger', 'morphologizer', 'trainable_lemmatizer']))
print(f'Second pipeline components: {P2.pipe_names}')
First pipeline components: ['senter', 'normer', 'tok2vec', 'trainable_lemmatizer', 'parser', 'ner', 'remorpher']
Second pipeline components: ['senter', 'normer', 'tok2vec', 'parser', 'lookup_lemmatizer', 'ner', 'remorpher']

We can then run the same text through both pipelines and compare the results side-by-side…

P1_annotations = P1(text)
P2_annotations = P2(text)

data = []

for p1_token, p2_token in zip(P1_annotations, P2_annotations):
    data.append([p1_token.text, p1_token.lemma_, p2_token.lemma_])

print(tabulate.tabulate(data, headers=['Text', 'Trainable lemmatizer', 'Lookup lemmatizer']))
Text          Trainable lemmatizer    Lookup lemmatizer
------------  ----------------------  -------------------
Haec          hic
narrantur     narro                   narro
a             ab
poetis        poeta                   poeta
de            de
Perseo        Perseus
.             .                       .
Perseus       perseus
filius        filius                  filius
erat          sum                     sum
Iovis         Iovis                   Iuppiter
,             ,                       ,
maximi        magnus
deorum        deus                    deus
.             .                       .
Avus          avus                    auus
eius          is                      is
Acrisius      Acrisius
appellabatur  appello                 appello
.             .                       .

Even if the lookup lemmatizer overwrites the lemma predicted by the trainable lemmatizer, the original lemma predicted by the trainable lemmatizer is still stored in a custom Token attribute, i.e. 

print(f'Token: {tokens[10].text}')
print(f'Lemma: {tokens[10].lemma_}')
print(f'Predicted lemma: {tokens[10]._.predicted_lemma}')
Token: Iovis
Lemma: Iuppiter
Predicted lemma: Iovis

Note specifically the lemmatization of Iovis—since it is a highly irregular form, it is not surprising that the Edit Tree Lemmatizer has manufactured a potential, but incorrect, lemma based on the root Iov-. Since Iovis is an unambiguous Latin form and has been added to the LatinCy lookups, the Lookup Lemmatizer steps in to correct the erroneous first pass. The lookup data can be found a custom form of the spaCy spacy-lookups-data package [https://github.com/diyclassics/spacy-lookups-data/tree/master/spacy_lookups_data/data]. These lookups are installed as a dependency of each of the LatinCy pipelines.

The code below shows how to access the lookup data directly…

# Load the lookups data

from spacy.lookups import Lookups, load_lookups

blank_nlp = spacy.blank("la")
lookups = Lookups()

lookups_data = load_lookups(lang=blank_nlp.vocab.lang, tables=["lemma_lookup"])
LOOKUPS = lookups_data.get_table("lemma_lookup")

print(LOOKUPS['Iovis'])
Iuppiter

What is described above is the default role of lemmatizers in the LatinCy pipeline. Of course, it may be the case that you want to add additional lookups to your annotation pipeline overriding the default lookups. Here is an example of such a set up using a custom Language.factory. We will cover custom factories in more detail in a later notebook.

First, let’s lemmatizer a sentence with some unexpected, non-Classical Latin words usind the default settings…

# cf. https://la.wikipedia.org/wiki/Star_Wars#Episodium_VI:_Iedi_reduces
doc = nlp("In planeta Tatooine, sunt Principissa Leia et Lucas Skywalker noviter Jedi factus")

data = []

for token in doc:
    data.append([token.text, token.lemma_])

print("Before...", end="\n\n")
print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
Before...

Text         Lemma
-----------  -----------
In           in
planeta      planeta
Tatooine     Tatooina
,            ,
sunt         sum
Principissa  principissa
Leia         Leia
et           et
Lucas        Lucas
Skywalker    Skywalker
noviter      noviter
Jedi         Jedus
factus       facio

Now we will lemmatize again but with a custom lookup table (i.e. dict) added to the pipeline…

from spacy import Language

custom_lemma_lookups = {
    "Jedi": ["Jedi"],
    "Tatooine": ["Tatooine"],
    "Leia": ["Leia", "Leiae", "Leiam"],
}

custom_lookups = {}
for key, values in custom_lemma_lookups.items():
    for value in values:
        if value not in lookups:
            custom_lookups[value] = key

@Language.component(name="custom_lookup_lemmatizer")
def make_lookup_lemmatizer_function(doc):
    for token in doc:
        token.lemma_ = custom_lookups.get(token.text, token.lemma_)
    return doc

try:
    nlp.add_pipe("custom_lookup_lemmatizer", name="custom_lookup_lemmatizer")
except:
    # If the pipeline component is already added, we can't add it again
    pass

# cf. https://la.wikipedia.org/wiki/Star_Wars#Episodium_VI:_Iedi_reduces
doc = nlp("In planeta Tatooine, sunt Principissa Leia et Lucas Skywalker noviter Jedi factus")

data = []

for token in doc:
    data.append([token.text, token.lemma_])

print("After...", end="\n\n")
print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
After...

Text         Lemma
-----------  -----------
In           in
planeta      planeta
Tatooine     Tatooine
,            ,
sunt         sum
Principissa  principissa
Leia         Leia
et           et
Lucas        Lucas
Skywalker    Skywalker
noviter      noviter
Jedi         Jedi
factus       facio

Note in particular how the default lemmatizers handled “Leia” without issue (being closer to an expected noun form ending in -a), but now better handles “Tatooine” and “Jedi” because of our custom lookup table.

References

NLTK Chapter 3, Section 3.6 “Normalizing text” link
SLP Chapter 2, Section 2.6 “Word normalization, lemmatization and stemming”, pp. 23-24 link
spaCy EditTreeLemmatizer link