13  Syntax iterators

13.1 Other kinds of “chunking” with LatinCy

As we saw in the previous notebooks, we can use the combination of annotations from the tagger and the parser to extract relevant spans of text, like noun chunks. Noun chunks are defined at the language level in spaCy—that is, in the “la” module itself—as a syntax iterator. (In fact, the spaCy documentation explicitly notes that the syntax iteration code is “at the moment, only used for noun chunks”).

But the logic that is used to define a language-level syntax iterator can just as easily be used to write other such “chunk” extractors. In this notebook, we will look at two other kinds of “chunking” that we can do with LatinCy: cum clauses and prepositional phrases.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_lg')

text = """Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit."""

doc = nlp(text)
print(doc)
Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit.

13.1.1 Extracting cum clauses with LatinCy

Writing syntax iterators can be more involved than other kinds of LatinCy text analysis. That said, it is still a good demonstration of how multiple annotations can work together to extract the kinds of syntactical structures that are of particular interest to Latinists whether in research or in teaching. We will break down the process as follows:

  1. First, let’s write a syntax iterator that will find (optimistically!) all subordinate clauses. (Note that at this stage in LatinCy development, parser performance, even on the best performing models, is far from perfect. Since determining where subordinate clauses begin and end corresponds directly to parser performance, by extension clause detection performance may be similarly imperfect.)
  2. Second, we will write a function that will only return those subordinate clauses that are introduced by the subordinating conjunction cum. To test for this, we will look for a token anywhere in the clause that has both the lemma “cum” and the POS tag “SCONJ”.

Note some details about the assumptions of this syntax iteration code…

  • It assumes that a clause only contains one ‘mark’ annotation from the dependency parser, i.e. one word like a relative pronoun or a subordinating conjunction that marks the beginning of the clause.
  • It removes both initial and final punctuation from the clause.
# Function for finding clauses

from spacy.symbols import VERB, PUNCT
from spacy.tokens import Doc

def find_clauses(doclike):
    """Extract clauses from Latin text by finding finite verbs within sentences."""
        
    def get_clause_tokens(verb_token):
        """Get all tokens in the clause by walking the dependency tree."""
        clause_tokens = set([verb_token])
        
        def collect_dependents(token):
            for child in token.children:
                # Stop at new clause boundaries
                if ((child.pos == VERB and child.morph.get("VerbForm") == ["Fin"]) or
                    child.dep_ == "cc"):
                    continue
                clause_tokens.add(child)
                collect_dependents(child)
                
        collect_dependents(verb_token)
        return clause_tokens

    doc = doclike.doc
    
    clause_label = doc.vocab.strings.add("CLAUSE")
    
    # Iterate through sentences and extract clauses
    for sent in doc.sents:
        finite_verbs = [t for t in sent 
                       if t.pos == VERB and t.morph.get("VerbForm") == ["Fin"]]
        
        for verb in finite_verbs:
            clause_tokens = get_clause_tokens(verb)
            if clause_tokens:
                # Get initial bounds, i.e. find the left and right side of the clause
                left = min(t.i for t in clause_tokens)
                right = max(t.i for t in clause_tokens)
                                
                # Check number of markers in the clause
                span = doc[left:right + 1]
                markers = [t for t in span if t.dep_ == "mark"]
                if len(markers) != 1:  # Only yield if exactly one marker
                    continue
                
                # Adjust bounds to skip punctuation
                while left < right and doc[left].pos == PUNCT:
                    left += 1
                while right > left and doc[right].pos == PUNCT:
                    right -= 1
                    
                yield left, right + 1, clause_label
# Process text
doc = nlp(text)

# Find and display clauses
print(f'Here are the subordinate clauses in the text "{text}":')

doc.spans["clauses"] = [doc[start:end] for start, end, _ in find_clauses(doc)]
for span in doc.spans["clauses"]:
    print(f"- {span.text}")
    print(f"  Clause verb: {[t for t in span if t.pos == VERB][0].text}")
    print(f'  Introductory word: {[t for t in span if t.dep_ == "mark"][0].lemma_}')    
Here are the subordinate clauses in the text "Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit.":
- Eo cum venisset
  Clause verb: venisset
  Introductory word: cum

Now, we write a function that extracts all of the subordinate clauses from a Doc object and limits its output to only those that meet our definition of a cum clause.

# Function for finding cum clauses

def find_cum_clauses(doclike):
    """Extract 'cum' clauses from Latin text by finding clauses containing lemma 'cum' as SCONJ."""
    
    doc = doclike.doc
    cum_label = doc.vocab.strings.add("CUM")
    
    # Get all clauses first
    all_clauses = list(find_clauses(doc))
    
    # Filter for clauses containing 'cum' as SCONJ anywhere in the clause
    for start, end, _ in all_clauses:
        span = doc[start:end]
        # Look for 'cum' anywhere in the clause
        has_cum = any(token.lemma_ == "cum" and token.pos_ == "SCONJ" for token in span) 
                     
        if has_cum:
            yield start, end, cum_label
# Process text
text = """Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit."""

doc = nlp(text)

# Find and display clauses
print(f'Here are the cum clauses in the text "{text}":')

doc.spans["clauses"] = [doc[start:end] for start, end, _ in find_cum_clauses(doc)]
for span in doc.spans["clauses"]:
    print(f"- {span.text}")
    print(f"  Clause verb: {[t for t in span if t.pos == VERB][0].text}")
    print(f'  Introductory word: {[t for t in span if t.dep_ == "mark"][0].lemma_}')
Here are the cum clauses in the text "Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit.":
- Eo cum venisset
  Clause verb: venisset
  Introductory word: cum

The advantage of using the syntax iterator—assuming that the dependency parser works well enough to correctly mark off clauses—is that it extracts more directly the syntactical structures that we are interested in. We may have tried a text analytical approach for this problem that started with a search for cum. Of course, cum can be both a marker of subordination (i.e. when it has the POS tag SCONJ) and a preposition (i.e. ADP). In the sample sentence, the iterator never even considers the prepositional uses of cum (in multis cum lacrimis).

# List annotations for each token

doc = nlp(text)

data = []

for token in doc:
    data.append([
        token.text,
        token.lemma_,
        token.pos_,
        token.dep_,
    ])

from tabulate import tabulate

print(tabulate(data, headers=['Token', 'Lemma', 'POS', "Dep"]))
Token       Lemma    POS    Dep
----------  -------  -----  -----
Eo          eo       PRON   obl
cum         cum      SCONJ  mark
venisset    uenio    VERB   advcl
,           ,        PUNCT  punct
ad          ad       ADP    case
pedes       pes      NOUN   obl
Iasonis     Iason    PROPN  nmod
se          sui      PRON   obj
proiecit    proicio  VERB   ROOT
et          et       CCONJ  cc
multis      multus   DET    obl
cum         cum      ADP    case
lacrimis    lacrima  NOUN   obl
eum         is       PRON   obj
obsecravit  obsecro  VERB   conj
.           .        PUNCT  punct

13.1.2 Extracting prepositional phrases with LatinCy

As we have already seen with noun chunking, there are syntactic structures that we may want to extract that are not at the clause level. Take, for example, prepositional phrases. In a sense, these phrases are the preposition equivalent of noun chunks. Our job is to identify the preposition and all of the words that “depend” on it (from a dependency parsing perspective). We can start by finding words tagged as prepositions and then looking for all of the child nodes of these words.

# Syntax iterator code for prepositional phrases

from typing import Iterator, Tuple, Union, Set
from spacy.symbols import ADP, PUNCT
from spacy.tokens import Doc, Span

def prepositional_phrases(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Find prepositional phrases by identifying prepositions and their objects."""
    
    def get_phrase_tokens(prep_token) -> Set:
        """Get all tokens in the prepositional phrase by walking the dependency tree."""
        phrase_tokens = set([prep_token])
        
        # Get the noun that this preposition modifies (its head)
        if prep_token.dep_ == "case" and prep_token.head.pos_ in ["NOUN", "PROPN", "PRON"]:
            noun = prep_token.head
            phrase_tokens.add(noun)
            
            # Get all dependents of the noun
            for child in noun.children:
                if child != prep_token:  # avoid adding preposition twice
                    phrase_tokens.add(child)
                    
        return phrase_tokens

    doc = doclike.doc
    
    if not doc.has_annotation("DEP"):
        raise ValueError(Errors.E029)

    if not len(doc):
        return

    pp_label = doc.vocab.strings.add("PP")
    prev_right = -1
    
    for token in doclike:
        if token.pos == ADP and token.dep_ == "case":
            phrase_tokens = get_phrase_tokens(token)
            if phrase_tokens:
                left = min(t.i for t in phrase_tokens)
                right = max(t.i for t in phrase_tokens)
                
                # Skip if overlaps with previous phrase
                if left <= prev_right:
                    continue
                    
                # Adjust bounds to skip punctuation
                while left < right and doc[left].pos == PUNCT:
                    left += 1
                while right > left and doc[right].pos == PUNCT:
                    right -= 1
                
                yield left, right + 1, pp_label
                prev_right = right
# Helper function for displaying prepositional phrases

def display_phrases_markdown(text: str, doc: Doc) -> str:
    """Display text with prepositional phrases marked in bold."""
    # Get prepositional phrases
    spans = [doc[start:end] for start, end, _ in prepositional_phrases(doc)]
    
    # Sort spans by start position in reverse order
    # (to avoid index confusion when inserting markers)
    spans.sort(key=lambda x: x.start, reverse=True)
    
    # Insert markdown markers
    marked_text = text
    for span in spans:
        marked_text = (
            marked_text[:span.start_char] + 
            f"**{span.text}**" + 
            marked_text[span.end_char:]
        )
    
    return marked_text

# Test with passage from RFF

test_texts = [
    "Haec narrantur a poetis de Perseo.",
    "Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat.",
    "Comprehendit igitur Perseum adhuc infantem, et cum matre in arca lignea inclusit.",
    "Tum arcam ipsam in mare coniecit.",
    "Perseus autem in sinu matris dormiebat."
]

# Display results
print("Prepositional phrases marked with asterisks...")
for text in test_texts:
    doc = nlp(text)
    marked = display_phrases_markdown(text, doc)
    print(f"- {marked}")
Prepositional phrases marked with asterisks...
- Haec narrantur **a poetis** **de Perseo**.
- Acrisius volebat Perseum nepotem suum necare; nam **propter oraculum** puerum timebat.
- Comprehendit igitur Perseum adhuc infantem, et **cum matre** **in arca lignea** inclusit.
- Tum arcam ipsam **in mare** coniecit.
- Perseus autem **in sinu matris** dormiebat.

Cum clauses and prepositional phrases are not the only type of syntactical structure that Latinists are interested. I invite readers to think about what the formal definitions with respect to LatinCy annotations would be for other kinds of structures, e.g. ablative absolutes, indirect statement, purpose clauses, etc., and see if they can be productively extracted using this kind of syntax iteration.