# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_md')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Word tokenization is the task of splitting a text into words (and wordlike units like punctuation, numbers, etc.). For the LatinCy models, tokenization is the fundamental pipeline component on which all other components depend. SpaCy uses non-destructive, “canonical” tokenization, i.e. non-destructive, in that the original text can be untokenized, so to speak, based on Token annotations and canonical in that indices are assigned to each token during this process and these indices are used to refer to the tokens in other annotations. (Tokens can be separated or merged, but this requires the user to actively undo and redefine the tokenization output.) LatinCy uses a modified version of the default spaCy tokenizer that recognizes and splits enlitic -que using a rules-based process. (NB: It is in the LatinCy development plan to move enclitic splitting to a separate post-tokenization component.)
The spaCy Doc object is an iterable and tokens are the iteration unit.
tokens = [item for item in doc]print(tokens)
[Haec, narrantur, a, poetis, de, Perseo, ., Perseus, filius, erat, Iovis, ,, maximi, deorum, ., Avus, eius, Acrisius, appellabatur, .]
token = tokens[0]print(type(token))
<class 'spacy.tokens.token.Token'>
The text content of a Token object can be retrieved with the text attribute.
for i, token inenumerate(tokens, 1):print(f'{i}: {token.text}')
1: Haec
2: narrantur
3: a
4: poetis
5: de
6: Perseo
7: .
8: Perseus
9: filius
10: erat
11: Iovis
12: ,
13: maximi
14: deorum
15: .
16: Avus
17: eius
18: Acrisius
19: appellabatur
20: .
Note again that the token itself is a spaCy Token object and that the text attribute returns a Python string even though their representations in the Jupyter Notebook look the same.
<class 'spacy.tokens.token.Token'> -> Haec
<class 'str'> -> Haec
5.1.1 Token attributes and methods related to tokenization
Here are some atrributes/methods available for spaCy Token objects that are relevant to word tokenization.
SpaCy keeps track of both the token indices and the character offsets within a doc using either the i or idx attributes, respectively…
print(token.doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
# token indicesfor token in doc:print(f'{token.i}: {token.text}')
0: Haec
1: narrantur
2: a
3: poetis
4: de
5: Perseo
6: .
7: Perseus
8: filius
9: erat
10: Iovis
11: ,
12: maximi
13: deorum
14: .
15: Avus
16: eius
17: Acrisius
18: appellabatur
19: .
This is functionally equivalent to using enumerate…
# token indices, with enumeratefor i, token inenumerate(doc):print(f'{i}: {token.text}')
0: Haec
1: narrantur
2: a
3: poetis
4: de
5: Perseo
6: .
7: Perseus
8: filius
9: erat
10: Iovis
11: ,
12: maximi
13: deorum
14: .
15: Avus
16: eius
17: Acrisius
18: appellabatur
19: .
Another indexing option is the idx attribute which is the character offset of the token in the original Doc object.
# character offsets, for token in doc:print(f'{token.idx}: {token.text}')
0: Haec
5: narrantur
15: a
17: poetis
24: de
27: Perseo
33: .
35: Perseus
43: filius
50: erat
55: Iovis
60: ,
62: maximi
69: deorum
75: .
77: Avus
82: eius
87: Acrisius
96: appellabatur
108: .
Observe these idx attributes relate to the character offsets from the original Doc. To illustrate the point, we will replace spaces with an underscore in the output. We can see from the output above that narrantur begins at idx 5 and that the next word a begins at idx 15. Yet narrantur is only 9 characters long and the difference between these two numbers is 10! This is because we need to account for whitespace in the original Doc. This is handled by the attribute text_with_ws.
text -> narrantur (length 9)
text_with_ws -> narrantur (length 10)
Accordingly, using the text_with_ws attribute (as opposed to simply the text attribute) we can reconstruct the original text. This is what was meant above by “non-destructive” tokenization. Look at the difference between a text joined using the text attribute and one joined using the text_with_ws attribute.
joined_tokens =' '.join([token.text for token in doc])print(joined_tokens)print(joined_tokens == doc.text)print()reconstructed_text =''.join([token.text_with_ws for token in doc])print(reconstructed_text)print(reconstructed_text == doc.text)
Haec narrantur a poetis de Perseo . Perseus filius erat Iovis , maximi deorum . Avus eius Acrisius appellabatur .
False
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
True
Because spaCy tokenization is set from the outset, you can traverse the tokens in a Doc objects from the tokens themselves using the nbor method. This method takes an integer argument that specifies the number of tokens to traverse. A positive integer traverses the tokens to the right, a negative integer traverses the tokens to the left.
print(doc[:6])print('-----')print(f'{doc[3]}, i.e. i = 3')print(f'{doc[3].nbor(-1)}, i.e. i - 1 = 2')print(f'{doc[3].nbor(-2)}, i.e. i - 2 = 1')print(f'{doc[3].nbor(1)}, i.e. i + 1 = 4')print(f'{doc[3].nbor(2)}, i.e. i + 2 = 5')
Haec narrantur a poetis de Perseo
-----
poetis, i.e. i = 3
a, i.e. i - 1 = 2
narrantur, i.e. i - 2 = 1
de, i.e. i + 1 = 4
Perseo, i.e. i + 2 = 5
5.1.2 Customization of the spaCy tokenizer in LatinCy
LatinCy aims to tokenize the que enclitic in Latin texts. As noted above this is currently done through a rule-based approach. Here is the custom tokenizer code (beginning at this line in the code) followed by a description of the process. Note that this process is based on the following recommendations in the spaCy documentation: https://spacy.io/usage/training#custom-tokenizer.
Basically, we treat que (and its case and u/v norm variants) as punctuation. These are added to the Defaults.suffixes. If no other intervention were made, then any word ending in que or a variant would be split into a before-que part and que. Since there are large number of relatively predictable words that end in que, these are maintained in a list called que_exceptions. All of the words in the que_exceptions list are added as a “special case” using the tokenizer’s add_special_case method and so will not be split. The que_exceptions lists is as follows:
You can see these words in the rules attribute of the tokenizer.
# Sample of 10 que rules from the custom tokenizertokenizer_rules = nlp.tokenizer.rulesprint(sorted(list(set([rule.lower() for rule in tokenizer_rules if'que'in rule])))[:10])
With the exception of basic enclitic splitting, the LatinCy tokenizer is the same as the default spaCy tokenizer. The default spaCy tokenizer is described in detail in the spaCy documentation. Here are some useful attributes/methods for working with LatinCy.
Tokenize a string without any other pipeline annotations with a call.
tokens = nlp.tokenizer(text)print(tokens)print(tokens[0].text)print(tokens[0].lemma_) # Note that there is no annotation here because since the tokenizer has been called directly, the lemmatizer—the entire pipeline, in fact—has not been run
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
Haec
A list of texts can be tokenized in one pass with the pipe method. This yields a generator object where each item is Doc object of tokenized-only texts
texts = ["Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat. Comprehendit igitur Perseum adhuc infantem, et cum matre in arca lignea inclusit. Tum arcam ipsam in mare coniecit. Danae, Persei mater, magnopere territa est; tempestas enim magna mare turbabat. Perseus autem in sinu matris dormiebat.", "Iuppiter tamen haec omnia vidit, et filium suum servare constituit. Tranquillum igitur fecit mare, et arcam ad insulam Seriphum perduxit. Huius insulae Polydectes tum rex erat. Postquam arca ad litus appulsa est, Danae in harena quietem capiebat. Post breve tempus a piscatore quodam reperta est, et ad domum regis Polydectis adducta est. Ille matrem et puerum benigne excepit, et iis sedem tutam in finibus suis dedit. Danae hoc donum libenter accepit, et pro tanto beneficio regi gratias egit.", "Perseus igitur multos annos ibi habitabat, et cum matre sua vitam beatam agebat. At Polydectes Danaen magnopere amabat, atque eam in matrimonium ducere volebat. Hoc tamen consilium Perseo minime gratum erat. Polydectes igitur Perseum dimittere constituit. Tum iuvenem ad se vocavit et haec dixit: \"Turpe est hanc ignavam vitam agere; iam dudum tu adulescens es. Quo usque hic manebis? Tempus est arma capere et virtutem praestare. Hinc abi, et caput Medusae mihi refer.\""]tokens =list(nlp.tokenizer.pipe(texts))print(len(tokens)) # number of documentsprint(len(tokens[0])) # number of tokens in first document
3
76
You can get an explanation of the tokenization “decisions” using the explain method. In the example below, we see how the que in virumque is treated as a suffix (as discussed above) and so is split during tokenization.