14 Sequence Matching

Author

Patrick J. Burns

Published

August 26, 2024

14.1 Sequence matching with LatinCy

We can use LatinCy annotations as the basis for matching spans of tokens using spaCy’s Matcher. This includes basic Token attributes like ORTH, TEXT, NORM, and LOWER as well as those annotated by the LatinCy pipeline like LEMMA, POS, TAG, MORPH, DEP, and ENT_TYPE. There are also more general attributes like IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_UPPER, IS_TITLE, IS_PUNCT, IS_SPACE, and LIKE_NUM, among still others. The full list can be found here. Moreover, there are many operators, quantifiers, and other operations that can be using to create ever-increasingly complex patterns. The combinatorial patterns for using the Matcher are frankly enormous and so this should be considered only a most basic introduction for what is possible.

Here we run through a quick example based on finding variations of res and then by pattern-matching extensions of res publica using the Praefatio to Livy’s Ab urbe condita.

# Imports & setup

import spacy
from pprint import pprint
from tabulate import tabulate

nlp = spacy.load('la_core_web_lg')

with open('livy_praefatio.txt') as f:
    text = f.read() 

doc = nlp(text)
print(doc[:100])

/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'la_core_web_lg' (3.8.0) was trained with spaCy v3.8.3 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim nec satis scio nec, si sciam, dicere ausim, quippe qui cum veterem tum volgatam esse rem videam, dum novi semper scriptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem vetustatem superaturos credunt. Utcumque erit, iuvabit tamen rerum gestarum memoriae principis terrarum populi pro virili parte et ipsum consuluisse; et si in tanta scriptorum turba mea fama in obscuro sit, nobilitate ac magnitudine eorum me qui nomini officient meo consoler. Res est praeterea et immensi operis,

The Matcher is initialized with the Vocab object from our loaded pipeline.

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
print(matcher)

<spacy.matcher.matcher.Matcher object at 0x33908dea0>

We use the Matcher by adding patterns, quite sensibly using the add method. With the add method, we assign a match_id “name” for the pattern and the patterns themselves. The patters are lists of lists of dictionaries; those dictionaries are arranged sequentially by the tokens sequences we want to match, where the dictionary keys are the attributes and the dictionary values are the our specific terms to be matched for that attribute. So, in the example below, we are looking for any span of tokens in the provided Doc where the text attribute matches—and matches exactly—the string “res”. The Matcher returns the match_id as well as the start and end indices for each matched span.

pattern = [{'TEXT': 'res'}]
matcher.add('res_tokens', [pattern])

matches = matcher(doc)

matches_data = []

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    matches_data.append((string_id, start, end, span.text))

print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))

Match ID      Start    End  Matched text
----------  -------  -----  --------------
res_tokens        8      9  res
res_tokens      421    422  res

# Helper functions

def pattern2matches(pattern_name, pattern):
    matcher = Matcher(nlp.vocab)
    matcher.add(pattern_name, [pattern])
    matches = matcher(doc)
    matches_data = []
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        matches_data.append((string_id, start, end, span.text))

    return matches_data

def tabulate_matches(pattern_name, pattern):
    matches_data = pattern2matches(pattern_name, pattern)
    print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))

Note that the matches above on TEXT are case-sensitive. We can widen our search for res by using the LOWER attribute…

pattern = [{'LOWER': 'res'}]
tabulate_matches('res_uncased', pattern)

Match ID       Start    End  Matched text
-----------  -------  -----  --------------
res_uncased        8      9  res
res_uncased       93     94  Res
res_uncased      421    422  res

Extending this logic even further, we can widen the search again by matching not only the token “Res”/“res” but all tokens for which the LatinCy lemmatizers have assigned the lemma “res”. This is done by using the LEMMA attribute.

pattern = [{'LEMMA': 'res'}]
tabulate_matches('res_lemma', pattern)

Match ID      Start    End  Matched text
----------  -------  -----  --------------
res_lemma         8      9  res
res_lemma        30     31  rem
res_lemma        39     40  rebus
res_lemma        57     58  rerum
res_lemma        93     94  Res
res_lemma       212    213  rerum
res_lemma       377    378  rerum
res_lemma       396    397  rei
res_lemma       421    422  res
res_lemma       459    460  rerum
res_lemma       502    503  rei

So far, all of our patterns have included only a single token. We can extend our search to include multiple sequential tokens by adding more dictionaries to the list of dictionaries. In the example below, we are looking for any span where the first token has the lemma “res” and is followed by any token with the POS of “NOUN”.

pattern = [{'LEMMA': 'res'}, {'POS': 'NOUN'}]
tabulate_matches('res_lemma_noun', pattern)

Match ID          Start    End  Matched text
--------------  -------  -----  --------------
res_lemma_noun        8     10  res populi

By contrast, we can return a span where the first token has the lemma “res” and is not followed by any token with the POS of “NOUN”.

pattern = [{'LEMMA': 'res'}, {'TAG': 'NOUN', "OP": "!"}]
tabulate_matches('res_lemma_noun_not', pattern)

Match ID              Start    End  Matched text
------------------  -------  -----  --------------
res_lemma_noun_not        8     10  res populi
res_lemma_noun_not       30     32  rem videam
res_lemma_noun_not       39     41  rebus certius
res_lemma_noun_not       57     59  rerum gestarum
res_lemma_noun_not       93     95  Res est
res_lemma_noun_not      212    214  rerum gestarum
res_lemma_noun_not      377    379  rerum salubre
res_lemma_noun_not      396    398  rei publicae
res_lemma_noun_not      421    423  res publica
res_lemma_noun_not      459    461  rerum minus
res_lemma_noun_not      502    504  rei absint

The Matcher allows for “fuzzy” matching based on Levenshtein distance (details in documentation, linked below)…

pattern = [{'LEMMA': 'res'}, {"TEXT": {"FUZZY": "public"}}]
tabulate_matches('res_public_fuzzy', pattern)

Match ID            Start    End  Matched text
----------------  -------  -----  --------------
res_public_fuzzy      396    398  rei publicae
res_public_fuzzy      421    423  res publica

Of course, you could search for res publica more directly with a two-lemma pattern…

pattern = [{'LEMMA': 'res'}, {'LEMMA': 'publicus'}]
tabulate_matches('res_publica_lemmas', pattern)

Match ID              Start    End  Matched text
------------------  -------  -----  --------------
res_publica_lemmas      396    398  rei publicae
res_publica_lemmas      421    423  res publica

The patterns are not limited to text search, that is they can be based entirely on annotation patterns. Here is a list of all NOUN-ADJ sequences in Livy’s “Praefatio”…

pattern = [{"POS": "NOUN"}, {"POS": "ADJ"}]
tabulate_matches('noun_adjs', pattern)

Match ID      Start    End  Matched text
----------  -------  -----  -------------------
noun_adjs         9     11  populi Romani
noun_adjs        39     41  rebus certius
noun_adjs        46     48  arte rudem
noun_adjs       127    129  origines proxima
noun_adjs       205    207  urbem poeticis
noun_adjs       235    237  urbium augustiora
noun_adjs       258    260  populo Romano
noun_adjs       274    276  gentes humanae
noun_adjs       377    379  rerum salubre
noun_adjs       396    398  rei publicae
noun_adjs       404    406  inceptu foedum
noun_adjs       421    423  res publica
noun_adjs       429    431  exemplis ditior
noun_adjs       534    536  successus prosperos

As well as a list of all alliterative patterns, though with some creative regexing…

matcher = Matcher(nlp.vocab)

for letter in "abcdefghijklmnopqrstuvwxyz":
    pattern = [{"LOWER": {"REGEX": rf'\b{letter}.+?\b'}, "OP": "{2,}"}]
    matcher.add('alliterative_pairs', [pattern])

matches = matcher(doc)

matches_data = []

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    matches_data.append((string_id, start, end, span.text))

print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))

Match ID              Start    End  Matched text
------------------  -------  -----  --------------------------
alliterative_pairs        3      5  sim si
alliterative_pairs       13     15  satis scio
alliterative_pairs       17     19  si sciam
alliterative_pairs       23     25  quippe qui
alliterative_pairs       35     37  semper scriptores
alliterative_pairs       41     43  aliquid allaturos
alliterative_pairs       62     64  populi pro
alliterative_pairs      102    104  supra septingentesimum
alliterative_pairs      142    144  pridem praevalentis
alliterative_pairs      142    145  pridem praevalentis populi
alliterative_pairs      143    145  praevalentis populi
alliterative_pairs      155    157  praemium petam
alliterative_pairs      203    205  conditam condendamve
alliterative_pairs      277    279  aequo animo
alliterative_pairs      289    291  animaduersa aut
alliterative_pairs      291    293  existimata erunt
alliterative_pairs      345    347  magis magis
alliterative_pairs      363    365  nostra nec
alliterative_pairs      366    368  pati possumus
alliterative_pairs      366    369  pati possumus perventum
alliterative_pairs      367    369  possumus perventum
alliterative_pairs      386    388  in inlustri
alliterative_pairs      393    395  tibi tuae
alliterative_pairs      478    480  pereundi perdendi
alliterative_pairs      513    515  deorum dearum

References

spaCy Matcher