6  Lemmatization

6.1 Lemmatization with LatinCy

Lemmatization is the task of mapping a token in a text to its dictionary headword. With the default LatinCy pipelines, two components are used to perform this task: 1. spaCy’s Edit Tree Lemmatizer and 2. a second custom Lookup Lemmatizer, named in the pipeline “trainable_lemmatizer” and “lookup_lemmatizer” respectively. In the first lemmatization pass, a probabilistic tree model is used to predict the transformation from the token form to its lemma. A second pass is made at the end of the pipeline which checks the token form against a large (~1M item) lemma dictionary (i.e. lookups) for ostensibly unambiguous forms; if a match is found, the lemma is overwritten with the corresponding value from lookup. The two-pass logic largely follows the approach recommended in Burns (2018) and Burns (2020), as facilitated by the spaCy pipeline architecture.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_sm')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Note here the two lemmatizer components that are included in the pipeline, i.e. “trainable_lemmatizer” and “lookup_lemmatizer”…

print(nlp.pipe_names)
['senter', 'normer', 'tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser', 'lookup_lemmatizer', 'ner']

Once a text is annotated using the LatinCy pipeline, i.e. as part of the Doc creation process, lemmas can be found as annotations of the Token objects…

sample_token = doc[0]

print(f'Sample token: {sample_token.text}')
print(f'Sample lemma: {sample_token.lemma_}')
Sample token: Haec
Sample lemma: hic
import tabulate

data = []

tokens = [item for item in doc]

for token in tokens:
    data.append([token.text, token.lemma_])    

print(tabulate.tabulate(data, headers=['Text', 'Lemma']))
Text          Lemma
------------  --------
Haec          hic
narrantur     narro
a             ab
poetis        poeta
de            de
Perseo        Perseo
.             .
Perseus       Perseus
filius        filius
erat          sum
Iovis         Iuppiter
,             ,
maximi
deorum        deus
.             .
Avus          auus
eius          is
Acrisius      Acrisius
appellabatur  appello
.             .

The lemma_ attribute has the type str and so is compatible with all string operations…

print(f'Token: {tokens[0].text}')
print(f'Lemma: {tokens[0].lemma_}')
print(f'Lowercase lemma: {tokens[0].lemma_.lower()}')
Token: Haec
Lemma: hic
Lowercase lemma: hic

The lemma_ attribute, though, is only the human-readable version of the lemma. Internally, spaCy uses a hash value to represent the lemma, which is stored in the lemma attribute. (Note the lack of trailing underscore.)

print(f'Token: {tokens[1].text}')
print(f'Human-readable lemma: {tokens[1].lemma_}')
print(f'spaCy lemma key: {tokens[1].lemma}')
Token: narrantur
Human-readable lemma: narro
spaCy lemma key: 11361982710182407617

In order to compare the two different lemmatization passes, we can create two copies of the LatinCy pipeline, each with one of the two lemmatizers removed…

import copy

P1 = copy.deepcopy(nlp)
P1.disable_pipes((['tagger', 'morphologizer', 'lookup_lemmatizer']))
print(f'First pipeline components: {P1.pipe_names}')

P2 = copy.deepcopy(nlp)
P2.disable_pipes((['tagger', 'morphologizer', 'trainable_lemmatizer']))
print(f'Second pipeline components: {P2.pipe_names}')
First pipeline components: ['senter', 'normer', 'tok2vec', 'trainable_lemmatizer', 'parser', 'ner']
Second pipeline components: ['senter', 'normer', 'tok2vec', 'parser', 'lookup_lemmatizer', 'ner']

We can then run the same text through both pipelines and compare the results side-by-side…

P1_annotations = P1(text)
P2_annotations = P2(text)

data = []

for p1_token, p2_token in zip(P1_annotations, P2_annotations):
    data.append([p1_token.text, p1_token.lemma_, p2_token.lemma_])

print(tabulate.tabulate(data, headers=['Text', 'Trainable lemmatizer', 'Lookup lemmatizer']))
Text          Trainable lemmatizer    Lookup lemmatizer
------------  ----------------------  -------------------
Haec          hic
narrantur     narro                   narro
a             ab
poetis        poetus                  poeta
de            de
Perseo        Perseo
.             .                       .
Perseus       Perseus
filius        filius                  filius
erat          sum                     sum
Iovis         Iovis                   Iuppiter
,             ,                       ,
maximi
deorum        deus                    deus
.             .                       .
Avus          avus                    auus
eius          is                      is
Acrisius      Acrisius
appellabatur  appello                 appello
.             .                       .

Note specifically the lemmatization of Iovis—since it is a highly irregular form, it is not surprising that the Edit Tree Lemmatizer has manufactured a potential, but incorrect, lemma based on the root Iov-. Since Iovis is an unambiguous Latin form and has been added to the LatinCy lookups, the Lookup Lemmatizer steps in to correct the erroneous first pass. The lookup data can be found a custom form of the spaCy spacy-lookups-data package [https://github.com/diyclassics/spacy-lookups-data/tree/master/spacy_lookups_data/data]. These lookups are installed as a dependency of each of the LatinCy pipelines.

The code below shows how to access the lookup data directly…

# Load the lookups data

from spacy.lookups import Lookups, load_lookups

blank_nlp = spacy.blank("la")
lookups = Lookups()

lookups_data = load_lookups(lang=blank_nlp.vocab.lang, tables=["lemma_lookup"])
LOOKUPS = lookups_data.get_table("lemma_lookup")

print(LOOKUPS['Iovis'])
Iuppiter

References

NLTK Chapter 3, Section 3.6 “Normalizing text” link
SLP Chapter 2, Section 2.6 “Word normalization, lemmatization and stemming”, pp. 23-24 link
spaCy EditTreeLemmatizer link