8  Morphological Tagging

8.1 Morphological Tagging with LatinCy

Morphological tagging is the task of mapping a token in a text to various morphological tags as appropriate for the token’s part of speech. A noun will have morphological tags for its gender, number, and case, while a verb—a finite verb, at least—will have tags for its person, number, tense, mood, and voice. The default LatinCy pipelines have a morphologizer component that assigns these tags.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_lg')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Morphological tags are annotations of the Token objects and are stored in the morph attribute.

sample_token = doc[1]

print(f'Sample token: {sample_token.text}')
print(f'Sample morph: {sample_token.morph}')
Sample token: narrantur
Sample morph: Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass

Note that the morphological annotations are stored in a spaCy’s MorphAnalysis object…

type(sample_token.morph)
spacy.tokens.morphanalysis.MorphAnalysis

LatinCy users may find the following MorphAnalysis methods of use.

To get all of the morphological tags for a token as a Python dict, you can use the to_dict method…

sample_morph_dict = sample_token.morph.to_dict()
pprint(sample_morph_dict)
{'Mood': 'Ind',
 'Number': 'Plur',
 'Person': '3',
 'Tense': 'Pres',
 'VerbForm': 'Fin',
 'Voice': 'Pass'}

You can also get the string representation of the morphological analysis with the to_json method:

sample_morph_dict = sample_token.morph.to_json()
pprint(sample_morph_dict)
'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass'

Conversely, a MorphAnalysis object can be created from a Python dict using the set_morph method on a spaCy Token object.

text = "Festina lente."
doc_no_annotations = nlp.make_doc(text)
print(f'{doc_no_annotations[0].text} -> {doc_no_annotations[0].morph if doc_no_annotations[0].morph else "{No morph data}"}')
Festina -> {No morph data}
festina_dict = {"Person": "2", "Number": "Singular", "Tense": "Present", "Mood": "Imperative", "Voice": "Active"}
doc_no_annotations[0].set_morph(festina_dict)
print(f'{doc_no_annotations[0].text} -> {doc_no_annotations[0].morph if doc_no_annotations[0].morph else "{No morph data}"}')
print(type(doc_no_annotations[0].morph))
Festina -> Mood=Imperative|Number=Singular|Person=2|Tense=Present|Voice=Active
<class 'spacy.tokens.morphanalysis.MorphAnalysis'>

As noted above, the mophological tags are POS-specific. LatinCy uses a limited set of morphological keys based on the UD treebanks. Moreover, with respect to the tag values, one specific adjustment that has been made from the UD treebanks is that for verbs LatinCy can return the six “traditional” tense values, i.e. present, imperfect, future, perfect, pluperfect, and future perfect. These are mapped to the custom Token attributes by the ‘remorpher’ component and stored in the ._.remorph attribute.

import tabulate

data = []

doc = nlp("Tum arcam ipsam in mare coniecit.")
tokens = [item for item in doc]

print('The morph for "coniecit" is...')
print(tokens[5].morph.to_json(), end='\n\n')

print('The remorph for "coniecit" is...')
print(tokens[5]._.remorph.to_json())
    
      
The morph for "coniecit" is...
Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act

The remorph for "coniecit" is...
Mood=Ind|Number=Sing|Person=3|Tense=Perf|VerbForm=Fin|Voice=Act

Note how in the sentence above the verb coniecit is tagged by the ‘remorpher’ with the ‘Tense=Perf’ (for perfect), whereas in the UD treebanks-derived morph attribute this is annotated with ‘Tense=Past’ in coordination with ‘Aspect=Perf’.

References

NLTK Chapter 5 “Categorizing and tagging words” link
spaCy Morphologizer