9  Sequence Matching

Author

Patrick J. Burns

Published

August 26, 2024

9.1 Sequence matching with LatinCy

We can use LatinCy annotations as the basis for matching spans of tokens using spaCy’s Matcher. This includes basic Token attributes like ORTH, TEXT, NORM, and LOWER as well as those annotated by the LatinCy pipeline like LEMMA, POS, TAG, MORPH, DEP, and ENT_TYPE. There are also more general attributes like IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_UPPER, IS_TITLE, IS_PUNCT, IS_SPACE, and LIKE_NUM, among still others. The full list can be found here. Moreover, there are many operators, quantifiers, and other operations that can be using to create ever-increasingly complex patterns. The combinatorial patterns for using the Matcher are frankly enormous and so this should be considered only a most basic introduction for what is possible.

Here we run through a quick example based on finding variations of res and then by pattern-matching extensions of res publica using the Praefatio to Livy’s Ab urbe condita.

# Imports & setup

import spacy
from pprint import pprint
from tabulate import tabulate

nlp = spacy.load('la_core_web_lg')

with open('livy_praefatio.txt') as f:
    text = f.read() 

doc = nlp(text)
print(doc[:100])
Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim nec satis scio nec, si sciam, dicere ausim, quippe qui cum veterem tum volgatam esse rem videam, dum novi semper scriptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem vetustatem superaturos credunt. Utcumque erit, iuvabit tamen rerum gestarum memoriae principis terrarum populi pro virili parte et ipsum consuluisse; et si in tanta scriptorum turba mea fama in obscuro sit, nobilitate ac magnitudine eorum me qui nomini officient meo consoler. Res est praeterea et immensi operis,

The Matcher is initialized with the Vocab object from our loaded pipeline.

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
print(matcher)
<spacy.matcher.matcher.Matcher object at 0x174721e10>

We use the Matcher by adding patterns, quite sensibly using the add method. With the add method, we assign a match_id “name” for the pattern and the patterns themselves. The patters are lists of lists of dictionaries; those dictionaries are arranged sequentially by the tokens sequences we want to match, where the dictionary keys are the attributes and the dictionary values are the our specific terms to be matched for that attribute. So, in the example below, we are looking for any span of tokens in the provided Doc where the text attribute matches—and matches exactly—the string “res”. The Matcher returns the match_id as well as the start and end indices for each matched span.

pattern = [{'TEXT': 'res'}]
matcher.add('res_tokens', [pattern])

matches = matcher(doc)

matches_data = []

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    matches_data.append((string_id, start, end, span.text))

print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))
Match ID      Start    End  Matched text
----------  -------  -----  --------------
res_tokens        8      9  res
res_tokens      424    425  res
# Helper functions

def pattern2matches(pattern_name, pattern):
    matcher = Matcher(nlp.vocab)
    matcher.add(pattern_name, [pattern])
    matches = matcher(doc)
    matches_data = []
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        matches_data.append((string_id, start, end, span.text))

    return matches_data

def tabulate_matches(pattern_name, pattern):
    matches_data = pattern2matches(pattern_name, pattern)
    print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))

Note that the matches above on TEXT are case-sensitive. We can widen our search for res by using the LOWER attribute…

pattern = [{'LOWER': 'res'}]
tabulate_matches('res_uncased', pattern)
Match ID       Start    End  Matched text
-----------  -------  -----  --------------
res_uncased        8      9  res
res_uncased       93     94  Res
res_uncased      424    425  res

Extending this logic even further, we can widen the search again by matching not only the token “Res”/“res” but all tokens for which the LatinCy lemmatizers have assigned the lemma “res”. This is done by using the LEMMA attribute.

pattern = [{'LEMMA': 'res'}]
tabulate_matches('res_lemma', pattern)
Match ID      Start    End  Matched text
----------  -------  -----  --------------
res_lemma         8      9  res
res_lemma        30     31  rem
res_lemma        39     40  rebus
res_lemma        57     58  rerum
res_lemma        93     94  Res
res_lemma       214    215  rerum
res_lemma       380    381  rerum
res_lemma       399    400  rei
res_lemma       424    425  res
res_lemma       463    464  rerum
res_lemma       507    508  rei

So far, all of our patterns have included only a single token. We can extend our search to include multiple sequential tokens by adding more dictionaries to the list of dictionaries. In the example below, we are looking for any span where the first token has the lemma “res” and is followed by any token with the POS of “NOUN”.

pattern = [{'LEMMA': 'res'}, {'POS': 'NOUN'}]
tabulate_matches('res_lemma_noun', pattern)
Match ID          Start    End  Matched text
--------------  -------  -----  --------------
res_lemma_noun        8     10  res populi

By contrast, we can return a span where the first token has the lemma “res” and is not followed by any token with the POS of “NOUN”.

pattern = [{'LEMMA': 'res'}, {'TAG': 'NOUN', "OP": "!"}]
tabulate_matches('res_lemma_noun_not', pattern)
Match ID              Start    End  Matched text
------------------  -------  -----  --------------
res_lemma_noun_not        8     10  res populi
res_lemma_noun_not       30     32  rem videam
res_lemma_noun_not       39     41  rebus certius
res_lemma_noun_not       57     59  rerum gestarum
res_lemma_noun_not       93     95  Res est
res_lemma_noun_not      214    216  rerum gestarum
res_lemma_noun_not      380    382  rerum salubre
res_lemma_noun_not      399    401  rei publicae
res_lemma_noun_not      424    426  res publica
res_lemma_noun_not      463    465  rerum minus
res_lemma_noun_not      507    509  rei absint

The Matcher allows for “fuzzy” matching based on Levenshtein distance (details in documentation, linked below)…

pattern = [{'LEMMA': 'res'}, {"TEXT": {"FUZZY": "public"}}]
tabulate_matches('res_public_fuzzy', pattern)
Match ID            Start    End  Matched text
----------------  -------  -----  --------------
res_public_fuzzy      399    401  rei publicae
res_public_fuzzy      424    426  res publica

Of course, you could search for res publica more directly with a two-lemma pattern…

pattern = [{'LEMMA': 'res'}, {'LEMMA': 'publicus'}]
tabulate_matches('res_publica_lemmas', pattern)
Match ID              Start    End  Matched text
------------------  -------  -----  --------------
res_publica_lemmas      399    401  rei publicae
res_publica_lemmas      424    426  res publica

The patterns are not limited to text search, that is they can be based entirely on annotation patterns. Here is a list of all NOUN-ADJ sequences in Livy’s “Praefatio”…

pattern = [{"POS": "NOUN"}, {"POS": "ADJ"}]
tabulate_matches('noun_adjs', pattern)
Match ID      Start    End  Matched text
----------  -------  -----  -------------------
noun_adjs         9     11  populi Romani
noun_adjs        46     48  arte rudem
noun_adjs       127    129  origines proxima
noun_adjs       207    209  urbem poeticis
noun_adjs       237    239  urbium augustiora
noun_adjs       260    262  populo Romano
noun_adjs       276    278  gentes humanae
noun_adjs       380    382  rerum salubre
noun_adjs       399    401  rei publicae
noun_adjs       407    409  inceptu foedum
noun_adjs       424    426  res publica
noun_adjs       432    434  exemplis ditior
noun_adjs       539    541  successus prosperos

As well as a list of all alliterative patterns, though with some creative regexing…

matcher = Matcher(nlp.vocab)

for letter in "abcdefghijklmnopqrstuvwxyz":
    pattern = [{"LOWER": {"REGEX": rf'\b{letter}.+?\b'}, "OP": "{2,}"}]
    matcher.add('alliterative_pairs', [pattern])

matches = matcher(doc)

matches_data = []

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    matches_data.append((string_id, start, end, span.text))

print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))
Match ID              Start    End  Matched text
------------------  -------  -----  --------------------------
alliterative_pairs        3      5  sim si
alliterative_pairs       13     15  satis scio
alliterative_pairs       17     19  si sciam
alliterative_pairs       23     25  quippe qui
alliterative_pairs       35     37  semper scriptores
alliterative_pairs       41     43  aliquid allaturos
alliterative_pairs       62     64  populi pro
alliterative_pairs      102    104  supra septingentesimum
alliterative_pairs      142    144  pridem praevalentis
alliterative_pairs      142    145  pridem praevalentis populi
alliterative_pairs      143    145  praevalentis populi
alliterative_pairs      155    157  praemium petam
alliterative_pairs      205    207  conditam condendamve
alliterative_pairs      279    281  aequo animo
alliterative_pairs      291    293  animaduersa aut
alliterative_pairs      293    295  existimata erunt
alliterative_pairs      347    349  magis magis
alliterative_pairs      365    367  nostra nec
alliterative_pairs      368    370  pati possumus
alliterative_pairs      368    371  pati possumus perventum
alliterative_pairs      369    371  possumus perventum
alliterative_pairs      389    391  in inlustri
alliterative_pairs      396    398  tibi tuae
alliterative_pairs      482    484  pereundi perdendi
alliterative_pairs      518    520  deorum dearum

References

spaCy Matcher