Part-of-speech tagging is the task of mapping a token in a text to its part of speech, whether ‘noun’ or ‘verb’ or ‘preposition’ and so on. There are two components in the default LatinCy pipelines that provide such annotations, that is the tagger and the morphologizer. Ostensibly, the tagger provides language-specific, fine-grain POS tags and the morphologizer provides coarse-grain tags (as defined by the UD Universal POS tags); at present, the LatinCy models have a high degree of overlap between these two tagsets and there are effectively no fine-grain tags.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Note here the two “tagging” components that are included in the pipeline, i.e. “tagger” and “morphologizer”…
Once a text is annotated using the LatinCy pipeline, i.e. as part of the Doc creation process, tags can be found as annotations of the Token objects. The coarse-grain tags are stored in the pos_ attribute; fine-grain tags are stored in the tag_ attribute.
Note the high degree of overlap between the coarse-grain and fine-grain tags in the LatinCy models in the chart below. That said, it is perhaps worth paying more attention to where the tagsets do not overlap. In the LatinCy training, conventions for classes of words to be labeled, say, “DET” (as in the Haecpos_) or “AUX” (as in est) are inferred from usage in the six different treebanks used. A sense of the inconsistency in the tagsets can be gleaned from the following page: (https://universaldependencies.org/la/); note also the important work of Gamba and Zeman (2023) in this area.
import tabulatedata = []tokens = [item for item in doc]for token in tokens: data.append([token.text, token.pos_, token.tag_]) print(tabulate.tabulate(data, headers=['Text', "POS", "TAG"]))
Text POS TAG
------------ ----- -----------
Haec DET pronoun
narrantur VERB verb
a ADP preposition
poetis NOUN noun
de ADP preposition
Perseo PROPN proper_noun
. PUNCT punc
Perseus PROPN proper_noun
filius NOUN noun
erat AUX verb
Iovis PROPN proper_noun
, PUNCT punc
maximi ADJ adjective
deorum NOUN noun
. PUNCT punc
Avus NOUN noun
eius PRON pronoun
Acrisius PROPN proper_noun
appellabatur VERB verb
. PUNCT punc
As with the lemma annotations, the pos_ and tag_ attributes are only the human-readable of the lemma. Internally, spaCy uses a hash value to represent this, again noting the lack of trailing underscore…
print(f"Token: {tokens[1].text}")print(f'Human-readable TAG: {tokens[1].tag_}')print(f'spaCy TAG key: {tokens[1].tag}')
Token: narrantur
Human-readable TAG: verb
spaCy TAG key: 6360137228241296794
This tag ‘key’ can be looked up in spaCy’s NLP.vocab.strings atrribute…
T = nlp.get_pipe('tagger')tag_lookup = T.vocab.strings[6360137228241296794] # also would work on `nlp`, i.e. nlp.vocab.strings[6360137228241296794]print(f'TAG key: {tokens[1].tag}')print(f'Human-readable TAG: {tag_lookup}')
TAG key: 6360137228241296794
Human-readable TAG: verb
M = nlp.get_pipe("morphologizer")pos_lookup = M.vocab.strings[100]print(f'POS key: {tokens[1].pos}')print(f'Human-readable POS: {pos_lookup}')
POS key: 100
Human-readable POS: VERB
We can use the label_data attribute from the morphologizer component to derive the complete (coarse-grain) tagset…
def split_pos(morph):if'POS='in morph:return morph.split('POS=')[1].split('|')[0]else:returnNonetagset =sorted(list(set([split_pos(k) for k, v in M.label_data['morph'].items() if split_pos(k)])))print(tagset)
SpaCy has an explain methods that can show human-readable descriptions of these standard tags…
data = []for tag in tagset: data.append([tag, spacy.explain(tag)])print(tabulate.tabulate(data, headers=['TAG', 'Description']))
TAG Description
----- -------------------------
ADJ adjective
ADP adposition
ADV adverb
AUX auxiliary
CCONJ coordinating conjunction
DET determiner
INTJ interjection
NOUN noun
NUM numeral
PART particle
PRON pronoun
PROPN proper noun
PUNCT punctuation
SCONJ subordinating conjunction
VERB verb
X other
It may also be useful to know the “confidence” of the tagger in making its decision. We can derive this from the output tagger’s model.predict method. This returns (at least, in part) a ranked list of per-scores, the maximum value of which determines the final annotation.
# Helper function to get tagging scoresdef get_tagging_scores(doc, n=3):# cf. https://stackoverflow.com/a/69228515 scores = [] tagger = nlp.get_pipe('tagger') labels = tagger.labelsfor token in doc: token_scores = tagger.model.predict([doc])[0][token.i] r = [*enumerate(token_scores)] r.sort(key=lambda x: x[1], reverse=True) scores.append([(labels[i], p) for i, p in r[:n]])return scores
# Get the top 3 tags by score for each token in the Doctagging_probs = get_tagging_scores(doc)for token in doc:print(f'Token: {token.text}', end='\n\n') data = []for label, prob in tagging_probs[token.i]: data.append([label, prob])print(tabulate.tabulate(data, headers=['Label', 'Score']))break
NLTK Chapter 5 “Categorizing and tagging words” link SLP Chapter 8 “Sequence labeling for parts of speech and named entities” link spaCyTagger and Morphologizer
Gamba, Federica, and Daniel Zeman. 2023. “Universalising LatinUniversalDependencies: AHarmonisation of LatinTreebanks in UD.” In Proceedings of the SixthWorkshop on UniversalDependencies (UDW, GURT/SyntaxFest 2023), edited by Loïc Grobol and Francis Tyers, 7–16. Washington, D.C.: Association for Computational Linguistics. https://aclanthology.org/2023.udw-1.2.