We can use LatinCy annotations as the basis for matching spans of tokens using spaCy’s Matcher. This includes basic Token attributes like ORTH, TEXT, NORM, and LOWER as well as those annotated by the LatinCy pipeline like LEMMA, POS, TAG, MORPH, DEP, and ENT_TYPE. There are also more general attributes like IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_UPPER, IS_TITLE, IS_PUNCT, IS_SPACE, and LIKE_NUM, among still others. The full list can be found here. Moreover, there are many operators, quantifiers, and other operations that can be using to create ever-increasingly complex patterns. The combinatorial patterns for using the Matcher are frankly enormous and so this should be considered only a most basic introduction for what is possible.
Here we run through a quick example based on finding variations of res and then by pattern-matching extensions of res publica using the Praefatio to Livy’s Ab urbe condita.
Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim nec satis scio nec, si sciam, dicere ausim, quippe qui cum veterem tum volgatam esse rem videam, dum novi semper scriptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem vetustatem superaturos credunt. Utcumque erit, iuvabit tamen rerum gestarum memoriae principis terrarum populi pro virili parte et ipsum consuluisse; et si in tanta scriptorum turba mea fama in obscuro sit, nobilitate ac magnitudine eorum me qui nomini officient meo consoler. Res est praeterea et immensi operis,
The Matcher is initialized with the Vocab object from our loaded pipeline.
from spacy.matcher import Matchermatcher = Matcher(nlp.vocab)print(matcher)
<spacy.matcher.matcher.Matcher object at 0x3377eadd0>
We use the Matcher by adding patterns, quite sensibly using the add method. With the add method, we assign a match_id “name” for the pattern and the patterns themselves. The patters are lists of lists of dictionaries; those dictionaries are arranged sequentially by the tokens sequences we want to match, where the dictionary keys are the attributes and the dictionary values are the our specific terms to be matched for that attribute. So, in the example below, we are looking for any span of tokens in the provided Doc where the text attribute matches—and matches exactly—the string “res”. The Matcher returns the match_id as well as the start and end indices for each matched span.
Match ID Start End Matched text
----------- ------- ----- --------------
res_uncased 8 9 res
res_uncased 93 94 Res
res_uncased 421 422 res
Extending this logic even further, we can widen the search again by matching not only the token “Res”/“res” but all tokens for which the LatinCy lemmatizers have assigned the lemma “res”. This is done by using the LEMMA attribute.
Match ID Start End Matched text
---------- ------- ----- --------------
res_lemma 8 9 res
res_lemma 30 31 rem
res_lemma 39 40 rebus
res_lemma 57 58 rerum
res_lemma 93 94 Res
res_lemma 212 213 rerum
res_lemma 377 378 rerum
res_lemma 396 397 rei
res_lemma 421 422 res
res_lemma 459 460 rerum
res_lemma 502 503 rei
So far, all of our patterns have included only a single token. We can extend our search to include multiple sequential tokens by adding more dictionaries to the list of dictionaries. In the example below, we are looking for any span where the first token has the lemma “res” and is followed by any token with the POS of “NOUN”.
Match ID Start End Matched text
------------------ ------- ----- --------------
res_lemma_noun_not 8 10 res populi
res_lemma_noun_not 30 32 rem videam
res_lemma_noun_not 39 41 rebus certius
res_lemma_noun_not 57 59 rerum gestarum
res_lemma_noun_not 93 95 Res est
res_lemma_noun_not 212 214 rerum gestarum
res_lemma_noun_not 377 379 rerum salubre
res_lemma_noun_not 396 398 rei publicae
res_lemma_noun_not 421 423 res publica
res_lemma_noun_not 459 461 rerum minus
res_lemma_noun_not 502 504 rei absint
The Matcher allows for “fuzzy” matching based on Levenshtein distance (details in documentation, linked below)…
Match ID Start End Matched text
---------------- ------- ----- --------------
res_public_fuzzy 396 398 rei publicae
res_public_fuzzy 421 423 res publica
Of course, you could search for res publica more directly with a two-lemma pattern…
Match ID Start End Matched text
------------------ ------- ----- --------------
res_publica_lemmas 396 398 rei publicae
res_publica_lemmas 421 423 res publica
The patterns are not limited to text search, that is they can be based entirely on annotation patterns. Here is a list of all NOUN-ADJ sequences in Livy’s “Praefatio”…
Match ID Start End Matched text
---------- ------- ----- -------------------
noun_adjs 9 11 populi Romani
noun_adjs 39 41 rebus certius
noun_adjs 46 48 arte rudem
noun_adjs 127 129 origines proxima
noun_adjs 205 207 urbem poeticis
noun_adjs 235 237 urbium augustiora
noun_adjs 258 260 populo Romano
noun_adjs 274 276 gentes humanae
noun_adjs 377 379 rerum salubre
noun_adjs 396 398 rei publicae
noun_adjs 404 406 inceptu foedum
noun_adjs 421 423 res publica
noun_adjs 429 431 exemplis ditior
noun_adjs 534 536 successus prosperos
As well as a list of all alliterative patterns, though with some creative regexing…