5 Word Tokenization

5.1 Word tokenization with LatinCy

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_sm')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)

/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'la_core_web_sm' (3.8.0) was trained with spaCy v3.8.4 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

Word tokenization is the task of splitting a text into words (and wordlike units like punctuation, numbers, etc.). For the LatinCy models, tokenization is the fundamental pipeline component on which all other components depend. SpaCy uses non-destructive, “canonical” tokenization, i.e. non-destructive, in that the original text can be untokenized, so to speak, based on Token annotations and canonical in that indices are assigned to each token during this process and these indices are used to refer to the tokens in other annotations. (Tokens can be separated or merged, but this requires the user to actively undo and redefine the tokenization output.) LatinCy uses a modified version of the default spaCy tokenizer that recognizes and splits enlitic -que using a rules-based process. (NB: It is in the LatinCy development plan to move enclitic splitting to a separate post-tokenization component.)

The spaCy Doc object is an iterable and tokens are the iteration unit.

tokens = [item for item in doc]
print(tokens)

[Haec, narrantur, a, poetis, de, Perseo, ., Perseus, filius, erat, Iovis, ,, maximi, deorum, ., Avus, eius, Acrisius, appellabatur, .]

token = tokens[0]
print(type(token))

<class 'spacy.tokens.token.Token'>

The text content of a Token object can be retrieved with the text attribute.

for i, token in enumerate(tokens, 1):
    print(f'{i}: {token.text}')

1: Haec
2: narrantur
3: a
4: poetis
5: de
6: Perseo
7: .
8: Perseus
9: filius
10: erat
11: Iovis
12: ,
13: maximi
14: deorum
15: .
16: Avus
17: eius
18: Acrisius
19: appellabatur
20: .

Note again that the token itself is a spaCy Token object and that the text attribute returns a Python string even though their representations in the Jupyter Notebook look the same.

token = tokens[0]
print(f'{type(token)} -> {token}')
print(f'{type(token.text)} -> {token.text}')

<class 'spacy.tokens.token.Token'> -> Haec
<class 'str'> -> Haec

5.1.1 Token attributes and methods related to tokenization

Here are some atrributes/methods available for spaCy Token objects that are relevant to word tokenization.

SpaCy keeps track of both the token indices and the character offsets within a doc using either the i or idx attributes, respectively…

print(token.doc)

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.

# token indices

for token in doc:
    print(f'{token.i}: {token.text}')

0: Haec
1: narrantur
2: a
3: poetis
4: de
5: Perseo
6: .
7: Perseus
8: filius
9: erat
10: Iovis
11: ,
12: maximi
13: deorum
14: .
15: Avus
16: eius
17: Acrisius
18: appellabatur
19: .

This is functionally equivalent to using enumerate…

# token indices, with enumerate

for i, token in enumerate(doc):
    print(f'{i}: {token.text}')

0: Haec
1: narrantur
2: a
3: poetis
4: de
5: Perseo
6: .
7: Perseus
8: filius
9: erat
10: Iovis
11: ,
12: maximi
13: deorum
14: .
15: Avus
16: eius
17: Acrisius
18: appellabatur
19: .

Another indexing option is the idx attribute which is the character offset of the token in the original Doc object.

# character offsets, 
for token in doc:
    print(f'{token.idx}: {token.text}')

0: Haec
5: narrantur
15: a
17: poetis
24: de
27: Perseo
33: .
35: Perseus
43: filius
50: erat
55: Iovis
60: ,
62: maximi
69: deorum
75: .
77: Avus
82: eius
87: Acrisius
96: appellabatur
108: .

Observe these idx attributes relate to the character offsets from the original Doc. To illustrate the point, we will replace spaces with an underscore in the output. We can see from the output above that narrantur begins at idx 5 and that the next word a begins at idx 15. Yet narrantur is only 9 characters long and the difference between these two numbers is 10! This is because we need to account for whitespace in the original Doc. This is handled by the attribute text_with_ws.

print(doc.text[5:15].replace(' ', '_'))

narrantur_


print(f'text -> {doc[1].text} (length {len(doc[1].text)})')
print()
print(f'text_with_ws -> {doc[1].text_with_ws} (length {len(doc[1].text_with_ws)})')

text -> narrantur (length 9)

text_with_ws -> narrantur  (length 10)

Accordingly, using the text_with_ws attribute (as opposed to simply the text attribute) we can reconstruct the original text. This is what was meant above by “non-destructive” tokenization. Look at the difference between a text joined using the text attribute and one joined using the text_with_ws attribute.

joined_tokens = ' '.join([token.text for token in doc])
print(joined_tokens)
print(joined_tokens == doc.text)

print()

reconstructed_text = ''.join([token.text_with_ws for token in doc])
print(reconstructed_text)
print(reconstructed_text == doc.text)

Haec narrantur a poetis de Perseo . Perseus filius erat Iovis , maximi deorum . Avus eius Acrisius appellabatur .
False

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
True

Because spaCy tokenization is set from the outset, you can traverse the tokens in a Doc objects from the tokens themselves using the nbor method. This method takes an integer argument that specifies the number of tokens to traverse. A positive integer traverses the tokens to the right, a negative integer traverses the tokens to the left.

print(doc[:6])
print('-----')
print(f'{doc[3]}, i.e. i = 3')
print(f'{doc[3].nbor(-1)}, i.e. i - 1 = 2')
print(f'{doc[3].nbor(-2)}, i.e. i - 2 = 1')
print(f'{doc[3].nbor(1)}, i.e. i + 1 = 4')
print(f'{doc[3].nbor(2)}, i.e. i + 2 = 5')

Haec narrantur a poetis de Perseo
-----
poetis, i.e. i = 3
a, i.e. i - 1 = 2
narrantur, i.e. i - 2 = 1
de, i.e. i + 1 = 4
Perseo, i.e. i + 2 = 5

5.1.2 Customization of the spaCy tokenizer in LatinCy

LatinCy aims to tokenize the que enclitic in Latin texts. As noted above this is currently done through a rule-based approach. Here is the custom tokenizer code (beginning at this line in the code) followed by a description of the process. Note that this process is based on the following recommendations in the spaCy documentation: https://spacy.io/usage/training#custom-tokenizer.

from spacy.util import registry, compile_suffix_regex

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        suffixes = nlp.Defaults.suffixes + [
            "que",
            "qve",
        ]
        suffix_regex = compile_suffix_regex(suffixes)
        nlp.tokenizer.suffix_search = suffix_regex.search

        for item in que_exceptions:
            nlp.tokenizer.add_special_case(item, [{"ORTH": item}])
            nlp.tokenizer.add_special_case(item.lower(), [{"ORTH": item.lower()}])
            nlp.tokenizer.add_special_case(item.title(), [{"ORTH": item.title()}])
            nlp.tokenizer.add_special_case(item.upper(), [{"ORTH": item.upper()}])

    return customize_tokenizer

Basically, we treat que (and its case and u/v norm variants) as punctuation. These are added to the Defaults.suffixes. If no other intervention were made, then any word ending in que or a variant would be split into a before-que part and que. Since there are large number of relatively predictable words that end in que, these are maintained in a list called que_exceptions. All of the words in the que_exceptions list are added as a “special case” using the tokenizer’s add_special_case method and so will not be split. The que_exceptions lists is as follows:

que_exceptions = ['quisque', 'quidque', 'quicque', 'quodque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quoque', 'quaque', 'quique', 'quaeque', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'uterque', 'utraque', 'utrumque', 'utriusque', 'utrique', 'utrumque', 'utramque', 'utroque', 'utraque', 'utrique', 'utraeque', 'utrorumque', 'utrarumque', 'utrisque', 'utrosque', 'utrasque', 'quicumque', 'quidcumque', 'quodcumque', 'cuiuscumque', 'cuicumque', 'quemcumque', 'quamcumque', 'quocumque', 'quacumque', 'quicumque', 'quaecumque', 'quorumcumque', 'quarumcumque', 'quibuscumque', 'quoscumque', 'quascumque', 'unusquisque', 'unaquaeque', 'unumquodque', 'unumquidque', 'uniuscuiusque', 'unicuique', 'unumquemque', 'unamquamque', 'unoquoque', 'unaquaque', 'plerusque', 'pleraque', 'plerumque', 'plerique', 'pleraeque', 'pleroque', 'pleramque', 'plerorumque', 'plerarumque', 'plerisque', 'plerosque', 'plerasque', 'absque', 'abusque', 'adaeque', 'adusque', 'aeque', 'antique', 'atque', 'circumundique', 'conseque', 'cumque', 'cunque', 'denique', 'deque', 'donique', 'hucusque', 'inique', 'inseque', 'itaque', 'longinque', 'namque', 'neque', 'oblique', 'peraeque', 'praecoque', 'propinque', 'qualiscumque', 'quandocumque', 'quandoque', 'quantuluscumque', 'quantumcumque', 'quantuscumque', 'quinque', 'quocumque', 'quomodocumque', 'quomque', 'quotacumque', 'quotcumque', 'quotienscumque', 'quotiensque', 'quotusquisque', 'quousque', 'relinque', 'simulatque', 'torque', 'ubicumque', 'ubique', 'undecumque', 'undique', 'usque', 'usquequaque', 'utcumque', 'utercumque', 'utique', 'utrimque', 'utrique', 'utriusque', 'utrobique', 'utrubique']

You can see these words in the rules attribute of the tokenizer.

# Sample of 10 que rules from the custom tokenizer

tokenizer_rules = nlp.tokenizer.rules
print(sorted(list(set([rule.lower() for rule in tokenizer_rules if 'que' in rule])))[:10])

['absque', 'abusque', 'adaeque', 'adusque', 'aeque', 'antique', 'atque', 'circumundique', 'conseque', 'cuicumque']

With the exception of basic enclitic splitting, the LatinCy tokenizer is the same as the default spaCy tokenizer. The default spaCy tokenizer is described in detail in the spaCy documentation. Here are some useful attributes/methods for working with LatinCy.

Tokenize a string without any other pipeline annotations with a call.

tokens = nlp.tokenizer(text)
print(tokens)
print(tokens[0].text)
print(tokens[0].lemma_) # Note that there is no annotation here because since the tokenizer has been called directly, the lemmatizer—the entire pipeline, in fact—has not been run

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Haec

A list of texts can be tokenized in one pass with the pipe method. This yields a generator object where each item is Doc object of tokenized-only texts

texts = ["Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat. Comprehendit igitur Perseum adhuc infantem, et cum matre in arca lignea inclusit. Tum arcam ipsam in mare coniecit. Danae, Persei mater, magnopere territa est; tempestas enim magna mare turbabat. Perseus autem in sinu matris dormiebat.", "Iuppiter tamen haec omnia vidit, et filium suum servare constituit. Tranquillum igitur fecit mare, et arcam ad insulam Seriphum perduxit. Huius insulae Polydectes tum rex erat. Postquam arca ad litus appulsa est, Danae in harena quietem capiebat. Post breve tempus a piscatore quodam reperta est, et ad domum regis Polydectis adducta est. Ille matrem et puerum benigne excepit, et iis sedem tutam in finibus suis dedit. Danae hoc donum libenter accepit, et pro tanto beneficio regi gratias egit.", "Perseus igitur multos annos ibi habitabat, et cum matre sua vitam beatam agebat. At Polydectes Danaen magnopere amabat, atque eam in matrimonium ducere volebat. Hoc tamen consilium Perseo minime gratum erat. Polydectes igitur Perseum dimittere constituit. Tum iuvenem ad se vocavit et haec dixit: \"Turpe est hanc ignavam vitam agere; iam dudum tu adulescens es. Quo usque hic manebis? Tempus est arma capere et virtutem praestare. Hinc abi, et caput Medusae mihi refer.\""]

tokens = list(nlp.tokenizer.pipe(texts))

print(len(tokens)) # number of documents
print(len(tokens[0])) # number of tokens in first document

3
76

You can get an explanation of the tokenization “decisions” using the explain method. In the example below, we see how the que in virumque is treated as a suffix (as discussed above) and so is split during tokenization.

tok_exp = nlp.tokenizer.explain('arma virumque cano')
print(tok_exp)

[('TOKEN', 'arma'), ('TOKEN', 'virum'), ('SUFFIX', 'que'), ('TOKEN', 'cano')]

tokens = nlp.tokenizer('arma uirumque cano')
for i, token in enumerate(tokens):
    print(f'{i}: {token.text}')

0: arma
1: uirum
2: que
3: cano

5.1.3 Normalization at the tokenization stage

In order to start the pipeline processing with text data that is as close as possible to the data used for training all of the downstream components, the LatinCy models now include a certain amount of normalization at the tokenization stage. These four normalization functions run at tokenization: 1. macron removal; 2. accent removal; 3. ligature separation; and 4. space adjustment (i.e. converting all whitespace to a single space and stripping whitespace from the left and right side of the input text).

text = " Hæc     peritè\n\nnarrāntur\t\ta poētīs dē Perseō Acrisiōque."
doc = nlp(text)

print("Before normalization:")
print(text, "\n")

print("After normalization:")
print(doc.text)

Before normalization:
 Hæc     peritè

narrāntur       a poētīs dē Perseō Acrisiōque. 

After normalization:
Haec perite narrantur a poetis de Perseo Acrisioque.

References

SLP Chapter 2, Section 2.4.2 “Word Tokenization”, pp. 18-20 link