8  Morphological Tagging

8.1 POS Tagging with LatinCy

Morphological tagging is the task of mapping a token in a text to various morphological tags as appropriate for the token’s part of speech. A noun will have morphological tags for its gender, number, and case, while a verb—a finite verb, at least—will have tags for its person, number, tense, mood, and voice. The default LatinCy pipelines have a morphologizer component that assigns these tags.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_lg')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Morphological tags are annotations of the Token objects and are stored in the morph attribute.

sample_token = doc[1]

print(f'Sample token: {sample_token.text}')
print(f'Sample morph: {sample_token.morph}')
Sample token: narrantur
Sample morph: Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass

Note that the morphological annotations are stored in a spaCy’s MorphAnalysis object…

type(sample_token.morph)
spacy.tokens.morphanalysis.MorphAnalysis

LatinCy users may find the following MorphAnalysis methods of use.

To get all of the morphological tags for a token as a Python dict, you can use the to_dict method…

sample_morph_dict = sample_token.morph.to_dict()
pprint(sample_morph_dict)
{'Mood': 'Ind',
 'Number': 'Plur',
 'Person': '3',
 'Tense': 'Pres',
 'VerbForm': 'Fin',
 'Voice': 'Pass'}

You can also get the string representation of the morphological analysis with the to_json method:

sample_morph_dict = sample_token.morph.to_json()
pprint(sample_morph_dict)
'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass'

Conversely, a MorphAnalysis object can be created from a Python dict using the set_morph method on a spaCy Token object.

text = "Festina lente."
doc_no_annotations = nlp.make_doc(text)
print(f'{doc_no_annotations[0].text} -> {doc_no_annotations[0].morph if doc_no_annotations[0].morph else "{No morph data}"}')
Festina -> {No morph data}
festina_dict = {"Person": "2", "Number": "Singular", "Tense": "Present", "Mood": "Imperative", "Voice": "Active"}
doc_no_annotations[0].set_morph(festina_dict)
print(f'{doc_no_annotations[0].text} -> {doc_no_annotations[0].morph if doc_no_annotations[0].morph else "{No morph data}"}')
print(type(doc_no_annotations[0].morph))
Festina -> Mood=Imperative|Number=Singular|Person=2|Tense=Present|Voice=Active
<class 'spacy.tokens.morphanalysis.MorphAnalysis'>

As noted above, the mophological tags are POS-specific. LatinCy uses a limited set of morphological keys based on the UD treebanks. Moreover, with respect to the tag values, one specific adjustment that has been made from the UD treebanks is that for verbs LatinCy uses the six “traditional” tense values, i.e. present, imperfect, future, perfect, pluperfect, and future perfect.

import tabulate

data = []

doc = nlp("Tum arcam ipsam in mare coniecit.")
tokens = [item for item in doc]

for token in tokens:
    data.append([token.text, token.morph.to_json()])    

print(tabulate.tabulate(data, headers=['Text', "Morph"])) 
Text      Morph
--------  ---------------------------------------------------------------
Tum
arcam     Case=Acc|Gender=Fem|Number=Sing
ipsam     Case=Acc|Gender=Fem|Number=Sing
in
mare      Case=Acc|Gender=Neut|Number=Sing
coniecit  Mood=Ind|Number=Sing|Person=3|Tense=Perf|VerbForm=Fin|Voice=Act
.

Note how in the sentence above the verb coniecit is tagged with the ‘Tense=Perf’ (for perfect), whereas in the UD treebanks this would have been annotated with ‘Tense=Past’ in coordination with ‘Aspect=Perf’.

References

NLTK Chapter 5 “Categorizing and tagging words” link
spaCy Morphologizer