Morphological tagging is the task of mapping a token in a text to various morphological tags as appropriate for the token’s part of speech. A noun will have morphological tags for its gender, number, and case, while a verb—a finite verb, at least—will have tags for its person, number, tense, mood, and voice. The default LatinCy pipelines have a morphologizer component that assigns these tags.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Morphological tags are annotations of the Token objects and are stored in the morph attribute.
As noted above, the mophological tags are POS-specific. LatinCy uses a limited set of morphological keys based on the UD treebanks. Moreover, with respect to the tag values, one specific adjustment that has been made from the UD treebanks is that for verbs LatinCy uses the six “traditional” tense values, i.e. present, imperfect, future, perfect, pluperfect, and future perfect.
import tabulatedata = []doc = nlp("Tum arcam ipsam in mare coniecit.")tokens = [item for item in doc]for token in tokens: data.append([token.text, token.morph.to_json()]) print(tabulate.tabulate(data, headers=['Text', "Morph"]))
Text Morph
-------- ---------------------------------------------------------------
Tum
arcam Case=Acc|Gender=Fem|Number=Sing
ipsam Case=Acc|Gender=Fem|Number=Sing
in
mare Case=Acc|Gender=Neut|Number=Sing
coniecit Mood=Ind|Number=Sing|Person=3|Tense=Perf|VerbForm=Fin|Voice=Act
.
Note how in the sentence above the verb coniecit is tagged with the ‘Tense=Perf’ (for perfect), whereas in the UD treebanks this would have been annotated with ‘Tense=Past’ in coordination with ‘Aspect=Perf’.
References
NLTK Chapter 5 “Categorizing and tagging words” link spaCyMorphologizer