7 POS Tagging

7.1 POS Tagging with LatinCy

Part-of-speech tagging is the task of mapping a token in a text to its part of speech, whether ‘noun’ or ‘verb’ or ‘preposition’ and so on. There are two components in the default LatinCy pipelines that provide such annotations, that is the tagger and the morphologizer. Ostensibly, the tagger provides language-specific, fine-grain POS tags and the morphologizer provides coarse-grain tags (as defined by the UD Universal POS tags); at present, the LatinCy models have a high degree of overlap between these two tagsets and there are effectively no fine-grain tags.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_lg')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)

/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'la_core_web_lg' (3.8.0) was trained with spaCy v3.8.3 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Note here the two “tagging” components that are included in the pipeline, i.e. “tagger” and “morphologizer”…

print(nlp.pipe_names)

['senter', 'normer', 'tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser', 'lookup_lemmatizer', 'ner', 'remorpher']

Once a text is annotated using the LatinCy pipeline, i.e. as part of the Doc creation process, tags can be found as annotations of the Token objects. The coarse-grain tags are stored in the pos_ attribute; fine-grain tags are stored in the tag_ attribute.

sample_token = doc[1]

print(f'Sample token: {sample_token.text}')
print(f'Sample POS: {sample_token.pos_}')
print(f'Sample TAG: {sample_token.tag_}')

Sample token: narrantur
Sample POS: VERB
Sample TAG: verb

Note the high degree of overlap between the coarse-grain and fine-grain tags in the LatinCy models in the chart below. That said, it is perhaps worth paying more attention to where the tagsets do not overlap. In the LatinCy training, conventions for classes of words to be labeled, say, “DET” (as in the Haec pos_) or “AUX” (as in est) are inferred from usage in the six different treebanks used. A sense of the inconsistency in the tagsets can be gleaned from the following page: (https://universaldependencies.org/la/); note also the important work of Gamba and Zeman (2023) in this area.

import tabulate

data = []

tokens = [item for item in doc]

for token in tokens:
    data.append([token.text, token.pos_, token.tag_])    

print(tabulate.tabulate(data, headers=['Text', "POS", "TAG"]))

Text          POS    TAG
------------  -----  -----------
Haec          DET    adjective
narrantur     VERB   verb
a             ADP    preposition
poetis        NOUN   noun
de            ADP    preposition
Perseo        PROPN  proper_noun
.             PUNCT  punc
Perseus       PROPN  proper_noun
filius        NOUN   noun
erat          AUX    verb
Iovis         PROPN  proper_noun
,             PUNCT  punc
maximi        ADJ    adjective
deorum        NOUN   noun
.             PUNCT  punc
Avus          NOUN   noun
eius          PRON   pronoun
Acrisius      PROPN  proper_noun
appellabatur  VERB   verb
.             PUNCT  punc

As with the lemma annotations, the pos_ and tag_ attributes are only the human-readable of the lemma. Internally, spaCy uses a hash value to represent this, again noting the lack of trailing underscore…

print(f"Token: {tokens[1].text}")
print(f'Human-readable TAG: {tokens[1].tag_}')
print(f'spaCy TAG key: {tokens[1].tag}')

Token: narrantur
Human-readable TAG: verb
spaCy TAG key: 6360137228241296794

This tag ‘key’ can be looked up in spaCy’s NLP.vocab.strings atrribute…

T = nlp.get_pipe('tagger')
tag_lookup = T.vocab.strings[6360137228241296794] # also would work on `nlp`, i.e. nlp.vocab.strings[6360137228241296794]

print(f'TAG key: {tokens[1].tag}')
print(f'Human-readable TAG: {tag_lookup}')

TAG key: 6360137228241296794
Human-readable TAG: verb

The same process applies to the POS ‘keys’…

print(f'Token: {tokens[1].text}')
print(f'Human-readable POS: {tokens[1].pos_}')
print(f'spaCy POS key: {tokens[1].pos}')

Token: narrantur
Human-readable POS: VERB
spaCy POS key: 100

M = nlp.get_pipe("morphologizer")
pos_lookup = M.vocab.strings[100]
print(f'POS key: {tokens[1].pos}')
print(f'Human-readable POS: {pos_lookup}')

POS key: 100
Human-readable POS: VERB

We can use the label_data attribute from the morphologizer component to derive the complete (coarse-grain) tagset…

def split_pos(morph):
    if 'POS=' in morph:
        return morph.split('POS=')[1].split('|')[0]
    else:
        return None
    
tagset = sorted(list(set([split_pos(k) for k, v in M.label_data['morph'].items() if split_pos(k)])))
print(tagset)

['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'VERB', 'X']

SpaCy has an explain methods that can show human-readable descriptions of these standard tags…

data = []

for tag in tagset:
    data.append([tag, spacy.explain(tag)])

print(tabulate.tabulate(data, headers=['TAG', 'Description']))

TAG    Description
-----  -------------------------
ADJ    adjective
ADP    adposition
ADV    adverb
AUX    auxiliary
CCONJ  coordinating conjunction
DET    determiner
INTJ   interjection
NOUN   noun
NUM    numeral
PART   particle
PRON   pronoun
PROPN  proper noun
PUNCT  punctuation
SCONJ  subordinating conjunction
VERB   verb
X      other

It may also be useful to know the “confidence” of the tagger in making its decision. We can derive this from the output tagger’s model.predict method. This returns (at least, in part) a ranked list of per-scores, the maximum value of which determines the final annotation.

# Helper function to get tagging scores

def get_tagging_scores(doc, n=3):
    # cf. https://stackoverflow.com/a/69228515
    scores = []
    tagger = nlp.get_pipe('tagger')
    labels = tagger.labels
    for token in doc:
        token_scores = tagger.model.predict([doc])[0][token.i]
        r = [*enumerate(token_scores)]
        r.sort(key=lambda x: x[1], reverse=True)
        scores.append([(labels[i], p) for i, p in r[:n]])
    return scores

# Get the top 3 tags by score for each token in the Doc

tagging_probs = get_tagging_scores(doc)

for token in doc:
    print(f'Token: {token.text}', end='\n\n')
    data = []
    for label, prob in tagging_probs[token.i]:
        data.append([label, prob])
    print(tabulate.tabulate(data, headers=['Label', 'Score']))
    break

Token: Haec

Label        Score
---------  -------
adjective  8.90505
pronoun    5.97949
noun       1.89938

References

NLTK Chapter 5 “Categorizing and tagging words” link
SLP Chapter 8 “Sequence labeling for parts of speech and named entities” link
spaCy Tagger and Morphologizer