We can use LatinCy annotations as the basis for matching spans of tokens using spaCy’s Matcher. This includes basic Token attributes like ORTH, TEXT, NORM, and LOWER as well as those annotated by the LatinCy pipeline like LEMMA, POS, TAG, MORPH, DEP, and ENT_TYPE. There are also more general attributes like IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_UPPER, IS_TITLE, IS_PUNCT, IS_SPACE, and LIKE_NUM, among still others. The full list can be found here. Moreover, there are many operators, quantifiers, and other operations that can be using to create ever-increasingly complex patterns. The combinatorial patterns for using the Matcher are frankly enormous and so this should be considered only a most basic introduction for what is possible.
Here we run through a quick example based on finding variations of res and then by pattern-matching extensions of res publica using the Praefatio to Livy’s Ab urbe condita.
Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim nec satis scio nec, si sciam, dicere ausim, quippe qui cum veterem tum volgatam esse rem videam, dum novi semper scriptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem vetustatem superaturos credunt. Utcumque erit, iuvabit tamen rerum gestarum memoriae principis terrarum populi pro virili parte et ipsum consuluisse; et si in tanta scriptorum turba mea fama in obscuro sit, nobilitate ac magnitudine eorum me qui nomini officient meo consoler. Res est praeterea et immensi operis,
The Matcher is initialized with the Vocab object from our loaded pipeline.
from spacy.matcher import Matchermatcher = Matcher(nlp.vocab)print(matcher)
<spacy.matcher.matcher.Matcher object at 0x174721e10>
We use the Matcher by adding patterns, quite sensibly using the add method. With the add method, we assign a match_id “name” for the pattern and the patterns themselves. The patters are lists of lists of dictionaries; those dictionaries are arranged sequentially by the tokens sequences we want to match, where the dictionary keys are the attributes and the dictionary values are the our specific terms to be matched for that attribute. So, in the example below, we are looking for any span of tokens in the provided Doc where the text attribute matches—and matches exactly—the string “res”. The Matcher returns the match_id as well as the start and end indices for each matched span.
Match ID Start End Matched text
----------- ------- ----- --------------
res_uncased 8 9 res
res_uncased 93 94 Res
res_uncased 424 425 res
Extending this logic even further, we can widen the search again by matching not only the token “Res”/“res” but all tokens for which the LatinCy lemmatizers have assigned the lemma “res”. This is done by using the LEMMA attribute.
Match ID Start End Matched text
---------- ------- ----- --------------
res_lemma 8 9 res
res_lemma 30 31 rem
res_lemma 39 40 rebus
res_lemma 57 58 rerum
res_lemma 93 94 Res
res_lemma 214 215 rerum
res_lemma 380 381 rerum
res_lemma 399 400 rei
res_lemma 424 425 res
res_lemma 463 464 rerum
res_lemma 507 508 rei
So far, all of our patterns have included only a single token. We can extend our search to include multiple sequential tokens by adding more dictionaries to the list of dictionaries. In the example below, we are looking for any span where the first token has the lemma “res” and is followed by any token with the POS of “NOUN”.
Match ID Start End Matched text
------------------ ------- ----- --------------
res_lemma_noun_not 8 10 res populi
res_lemma_noun_not 30 32 rem videam
res_lemma_noun_not 39 41 rebus certius
res_lemma_noun_not 57 59 rerum gestarum
res_lemma_noun_not 93 95 Res est
res_lemma_noun_not 214 216 rerum gestarum
res_lemma_noun_not 380 382 rerum salubre
res_lemma_noun_not 399 401 rei publicae
res_lemma_noun_not 424 426 res publica
res_lemma_noun_not 463 465 rerum minus
res_lemma_noun_not 507 509 rei absint
The Matcher allows for “fuzzy” matching based on Levenshtein distance (details in documentation, linked below)…
Match ID Start End Matched text
---------------- ------- ----- --------------
res_public_fuzzy 399 401 rei publicae
res_public_fuzzy 424 426 res publica
Of course, you could search for res publica more directly with a two-lemma pattern…
Match ID Start End Matched text
------------------ ------- ----- --------------
res_publica_lemmas 399 401 rei publicae
res_publica_lemmas 424 426 res publica
The patterns are not limited to text search, that is they can be based entirely on annotation patterns. Here is a list of all NOUN-ADJ sequences in Livy’s “Praefatio”…
Match ID Start End Matched text
---------- ------- ----- -------------------
noun_adjs 9 11 populi Romani
noun_adjs 46 48 arte rudem
noun_adjs 127 129 origines proxima
noun_adjs 207 209 urbem poeticis
noun_adjs 237 239 urbium augustiora
noun_adjs 260 262 populo Romano
noun_adjs 276 278 gentes humanae
noun_adjs 380 382 rerum salubre
noun_adjs 399 401 rei publicae
noun_adjs 407 409 inceptu foedum
noun_adjs 424 426 res publica
noun_adjs 432 434 exemplis ditior
noun_adjs 539 541 successus prosperos
As well as a list of all alliterative patterns, though with some creative regexing…