5  Word Tokenization

5.1 Word tokenization with LatinCy

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_md')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Word tokenization is the task of splitting a text into words (and wordlike units like punctuation, numbers, etc.). For the LatinCy models, tokenization is the fundamental pipeline component on which all other components depend. SpaCy uses non-destructive, “canonical” tokenization, i.e. non-destructive, in that the original text can be untokenized, so to speak, based on Token annotations and canonical in that indices are assigned to each token during this process and these indices are used to refer to the tokens in other annotations. (Tokens can be separated or merged, but this requires the user to actively undo and redefine the tokenization output.) LatinCy uses a modified version of the default spaCy tokenizer that recognizes and splits enlitic -que using a rules-based process. (NB: It is in the LatinCy development plan to move enclitic splitting to a separate post-tokenization component.)

The spaCy Doc object is an iterable and tokens are the iteration unit.

tokens = [item for item in doc]
print(tokens)
[Haec, narrantur, a, poetis, de, Perseo, ., Perseus, filius, erat, Iovis, ,, maximi, deorum, ., Avus, eius, Acrisius, appellabatur, .]
token = tokens[0]
print(type(token))
<class 'spacy.tokens.token.Token'>

The text content of a Token object can be retrieved with the text attribute.

for i, token in enumerate(tokens, 1):
    print(f'{i}: {token.text}')
1: Haec
2: narrantur
3: a
4: poetis
5: de
6: Perseo
7: .
8: Perseus
9: filius
10: erat
11: Iovis
12: ,
13: maximi
14: deorum
15: .
16: Avus
17: eius
18: Acrisius
19: appellabatur
20: .

Note again that the token itself is a spaCy Token object and that the text attribute returns a Python string even though their representations in the Jupyter Notebook look the same.

token = tokens[0]
print(f'{type(token)} -> {token}')
print(f'{type(token.text)} -> {token.text}')
<class 'spacy.tokens.token.Token'> -> Haec
<class 'str'> -> Haec

5.1.2 Customization of the spaCy tokenizer in LatinCy

LatinCy aims to tokenize the que enclitic in Latin texts. As noted above this is currently done through a rule-based approach. Here is the custom tokenizer code (beginning at this line in the code) followed by a description of the process. Note that this process is based on the following recommendations in the spaCy documentation: https://spacy.io/usage/training#custom-tokenizer.

from spacy.util import registry, compile_suffix_regex

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        suffixes = nlp.Defaults.suffixes + [
            "que",
            "qve",
        ]
        suffix_regex = compile_suffix_regex(suffixes)
        nlp.tokenizer.suffix_search = suffix_regex.search

        for item in que_exceptions:
            nlp.tokenizer.add_special_case(item, [{"ORTH": item}])
            nlp.tokenizer.add_special_case(item.lower(), [{"ORTH": item.lower()}])
            nlp.tokenizer.add_special_case(item.title(), [{"ORTH": item.title()}])
            nlp.tokenizer.add_special_case(item.upper(), [{"ORTH": item.upper()}])

    return customize_tokenizer

Basically, we treat que (and its case and u/v norm variants) as punctuation. These are added to the Defaults.suffixes. If no other intervention were made, then any word ending in que or a variant would be split into a before-que part and que. Since there are large number of relatively predictable words that end in que, these are maintained in a list called que_exceptions. All of the words in the que_exceptions list are added as a “special case” using the tokenizer’s add_special_case method and so will not be split. The que_exceptions lists is as follows:

que_exceptions = ['quisque', 'quidque', 'quicque', 'quodque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quoque', 'quaque', 'quique', 'quaeque', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'uterque', 'utraque', 'utrumque', 'utriusque', 'utrique', 'utrumque', 'utramque', 'utroque', 'utraque', 'utrique', 'utraeque', 'utrorumque', 'utrarumque', 'utrisque', 'utrosque', 'utrasque', 'quicumque', 'quidcumque', 'quodcumque', 'cuiuscumque', 'cuicumque', 'quemcumque', 'quamcumque', 'quocumque', 'quacumque', 'quicumque', 'quaecumque', 'quorumcumque', 'quarumcumque', 'quibuscumque', 'quoscumque', 'quascumque', 'unusquisque', 'unaquaeque', 'unumquodque', 'unumquidque', 'uniuscuiusque', 'unicuique', 'unumquemque', 'unamquamque', 'unoquoque', 'unaquaque', 'plerusque', 'pleraque', 'plerumque', 'plerique', 'pleraeque', 'pleroque', 'pleramque', 'plerorumque', 'plerarumque', 'plerisque', 'plerosque', 'plerasque', 'absque', 'abusque', 'adaeque', 'adusque', 'aeque', 'antique', 'atque', 'circumundique', 'conseque', 'cumque', 'cunque', 'denique', 'deque', 'donique', 'hucusque', 'inique', 'inseque', 'itaque', 'longinque', 'namque', 'neque', 'oblique', 'peraeque', 'praecoque', 'propinque', 'qualiscumque', 'quandocumque', 'quandoque', 'quantuluscumque', 'quantumcumque', 'quantuscumque', 'quinque', 'quocumque', 'quomodocumque', 'quomque', 'quotacumque', 'quotcumque', 'quotienscumque', 'quotiensque', 'quotusquisque', 'quousque', 'relinque', 'simulatque', 'torque', 'ubicumque', 'ubique', 'undecumque', 'undique', 'usque', 'usquequaque', 'utcumque', 'utercumque', 'utique', 'utrimque', 'utrique', 'utriusque', 'utrobique', 'utrubique']

You can see these words in the rules attribute of the tokenizer.

# Sample of 10 que rules from the custom tokenizer

tokenizer_rules = nlp.tokenizer.rules
print(sorted(list(set([rule.lower() for rule in tokenizer_rules if 'que' in rule])))[:10])
['absque', 'abusque', 'adaeque', 'adusque', 'aeque', 'antique', 'atque', 'circumundique', 'conseque', 'cuicumque']

With the exception of basic enclitic splitting, the LatinCy tokenizer is the same as the default spaCy tokenizer. The default spaCy tokenizer is described in detail in the spaCy documentation. Here are some useful attributes/methods for working with LatinCy.

Tokenize a string without any other pipeline annotations with a call.

tokens = nlp.tokenizer(text)
print(tokens)
print(tokens[0].text)
print(tokens[0].lemma_) # Note that there is no annotation here because since the tokenizer has been called directly, the lemmatizer—the entire pipeline, in fact—has not been run
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Haec

A list of texts can be tokenized in one pass with the pipe method. This yields a generator object where each item is Doc object of tokenized-only texts

texts = ["Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat. Comprehendit igitur Perseum adhuc infantem, et cum matre in arca lignea inclusit. Tum arcam ipsam in mare coniecit. Danae, Persei mater, magnopere territa est; tempestas enim magna mare turbabat. Perseus autem in sinu matris dormiebat.", "Iuppiter tamen haec omnia vidit, et filium suum servare constituit. Tranquillum igitur fecit mare, et arcam ad insulam Seriphum perduxit. Huius insulae Polydectes tum rex erat. Postquam arca ad litus appulsa est, Danae in harena quietem capiebat. Post breve tempus a piscatore quodam reperta est, et ad domum regis Polydectis adducta est. Ille matrem et puerum benigne excepit, et iis sedem tutam in finibus suis dedit. Danae hoc donum libenter accepit, et pro tanto beneficio regi gratias egit.", "Perseus igitur multos annos ibi habitabat, et cum matre sua vitam beatam agebat. At Polydectes Danaen magnopere amabat, atque eam in matrimonium ducere volebat. Hoc tamen consilium Perseo minime gratum erat. Polydectes igitur Perseum dimittere constituit. Tum iuvenem ad se vocavit et haec dixit: \"Turpe est hanc ignavam vitam agere; iam dudum tu adulescens es. Quo usque hic manebis? Tempus est arma capere et virtutem praestare. Hinc abi, et caput Medusae mihi refer.\""]

tokens = list(nlp.tokenizer.pipe(texts))

print(len(tokens)) # number of documents
print(len(tokens[0])) # number of tokens in first document
3
76

You can get an explanation of the tokenization “decisions” using the explain method. In the example below, we see how the que in virumque is treated as a suffix (as discussed above) and so is split during tokenization.

tok_exp = nlp.tokenizer.explain('arma virumque cano')
print(tok_exp)
[('TOKEN', 'arma'), ('TOKEN', 'virum'), ('SUFFIX', 'que'), ('TOKEN', 'cano')]
tokens = nlp.tokenizer('arma uirumque cano')
for i, token in enumerate(tokens):
    print(f'{i}: {token.text}')
0: arma
1: uirum
2: que
3: cano

References

SLP Chapter 2, Section 2.4.2 “Word Tokenization”, pp. 18-20 link