As we saw in the previous notebooks, we can use the combination of annotations from the tagger and the parser to extract relevant spans of text, like noun chunks. Noun chunks are defined at the language level in spaCy—that is, in the “la” module itself—as a syntax iterator. (In fact, the spaCy documentation explicitly notes that the syntax iteration code is “at the moment, only used for noun chunks”).
But the logic that is used to define a language-level syntax iterator can just as easily be used to write other such “chunk” extractors. In this notebook, we will look at two other kinds of “chunking” that we can do with LatinCy: cum clauses and prepositional phrases.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="""Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit."""doc = nlp(text)print(doc)
Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit.
13.1.1 Extracting cum clauses with LatinCy
Writing syntax iterators can be more involved than other kinds of LatinCy text analysis. That said, it is still a good demonstration of how multiple annotations can work together to extract the kinds of syntactical structures that are of particular interest to Latinists whether in research or in teaching. We will break down the process as follows:
First, let’s write a syntax iterator that will find (optimistically!) all subordinate clauses. (Note that at this stage in LatinCy development, parser performance, even on the best performing models, is far from perfect. Since determining where subordinate clauses begin and end corresponds directly to parser performance, by extension clause detection performance may be similarly imperfect.)
Second, we will write a function that will only return those subordinate clauses that are introduced by the subordinating conjunction cum. To test for this, we will look for a token anywhere in the clause that has both the lemma “cum” and the POS tag “SCONJ”.
Note some details about the assumptions of this syntax iteration code…
It assumes that a clause only contains one ‘mark’ annotation from the dependency parser, i.e. one word like a relative pronoun or a subordinating conjunction that marks the beginning of the clause.
It removes both initial and final punctuation from the clause.
# Function for finding clausesfrom spacy.symbols import VERB, PUNCTfrom spacy.tokens import Docdef find_clauses(doclike):"""Extract clauses from Latin text by finding finite verbs within sentences."""def get_clause_tokens(verb_token):"""Get all tokens in the clause by walking the dependency tree.""" clause_tokens =set([verb_token])def collect_dependents(token):for child in token.children:# Stop at new clause boundariesif ((child.pos == VERB and child.morph.get("VerbForm") == ["Fin"]) or child.dep_ =="cc"):continue clause_tokens.add(child) collect_dependents(child) collect_dependents(verb_token)return clause_tokens doc = doclike.doc clause_label = doc.vocab.strings.add("CLAUSE")# Iterate through sentences and extract clausesfor sent in doc.sents: finite_verbs = [t for t in sent if t.pos == VERB and t.morph.get("VerbForm") == ["Fin"]]for verb in finite_verbs: clause_tokens = get_clause_tokens(verb)if clause_tokens:# Get initial bounds, i.e. find the left and right side of the clause left =min(t.i for t in clause_tokens) right =max(t.i for t in clause_tokens)# Check number of markers in the clause span = doc[left:right +1] markers = [t for t in span if t.dep_ =="mark"]iflen(markers) !=1: # Only yield if exactly one markercontinue# Adjust bounds to skip punctuationwhile left < right and doc[left].pos == PUNCT: left +=1while right > left and doc[right].pos == PUNCT: right -=1yield left, right +1, clause_label
# Process textdoc = nlp(text)# Find and display clausesprint(f'Here are the subordinate clauses in the text "{text}":')doc.spans["clauses"] = [doc[start:end] for start, end, _ in find_clauses(doc)]for span in doc.spans["clauses"]:print(f"- {span.text}")print(f" Clause verb: {[t for t in span if t.pos == VERB][0].text}")print(f' Introductory word: {[t for t in span if t.dep_ =="mark"][0].lemma_}')
Here are the subordinate clauses in the text "Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit.":
- Eo cum venisset
Clause verb: venisset
Introductory word: cum
Now, we write a function that extracts all of the subordinate clauses from a Doc object and limits its output to only those that meet our definition of a cum clause.
# Function for finding cum clausesdef find_cum_clauses(doclike):"""Extract 'cum' clauses from Latin text by finding clauses containing lemma 'cum' as SCONJ.""" doc = doclike.doc cum_label = doc.vocab.strings.add("CUM")# Get all clauses first all_clauses =list(find_clauses(doc))# Filter for clauses containing 'cum' as SCONJ anywhere in the clausefor start, end, _ in all_clauses: span = doc[start:end]# Look for 'cum' anywhere in the clause has_cum =any(token.lemma_ =="cum"and token.pos_ =="SCONJ"for token in span) if has_cum:yield start, end, cum_label
# Process texttext ="""Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit."""doc = nlp(text)# Find and display clausesprint(f'Here are the cum clauses in the text "{text}":')doc.spans["clauses"] = [doc[start:end] for start, end, _ in find_cum_clauses(doc)]for span in doc.spans["clauses"]:print(f"- {span.text}")print(f" Clause verb: {[t for t in span if t.pos == VERB][0].text}")print(f' Introductory word: {[t for t in span if t.dep_ =="mark"][0].lemma_}')
Here are the cum clauses in the text "Eo cum venisset, ad pedes Iasonis se proiecit et multis cum lacrimis eum obsecravit.":
- Eo cum venisset
Clause verb: venisset
Introductory word: cum
The advantage of using the syntax iterator—assuming that the dependency parser works well enough to correctly mark off clauses—is that it extracts more directly the syntactical structures that we are interested in. We may have tried a text analytical approach for this problem that started with a search for cum. Of course, cum can be both a marker of subordination (i.e. when it has the POS tag SCONJ) and a preposition (i.e. ADP). In the sample sentence, the iterator never even considers the prepositional uses of cum (in multis cum lacrimis).
# List annotations for each tokendoc = nlp(text)data = []for token in doc: data.append([ token.text, token.lemma_, token.pos_, token.dep_, ])from tabulate import tabulateprint(tabulate(data, headers=['Token', 'Lemma', 'POS', "Dep"]))
Token Lemma POS Dep
---------- ------- ----- -----
Eo eo PRON obl
cum cum SCONJ mark
venisset uenio VERB advcl
, , PUNCT punct
ad ad ADP case
pedes pes NOUN obl
Iasonis Iason PROPN nmod
se sui PRON obj
proiecit proicio VERB ROOT
et et CCONJ cc
multis multus DET obl
cum cum ADP case
lacrimis lacrima NOUN obl
eum is PRON obj
obsecravit obsecro VERB conj
. . PUNCT punct
13.1.2 Extracting prepositional phrases with LatinCy
As we have already seen with noun chunking, there are syntactic structures that we may want to extract that are not at the clause level. Take, for example, prepositional phrases. In a sense, these phrases are the preposition equivalent of noun chunks. Our job is to identify the preposition and all of the words that “depend” on it (from a dependency parsing perspective). We can start by finding words tagged as prepositions and then looking for all of the child nodes of these words.
# Syntax iterator code for prepositional phrasesfrom typing import Iterator, Tuple, Union, Setfrom spacy.symbols import ADP, PUNCTfrom spacy.tokens import Doc, Spandef prepositional_phrases(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:"""Find prepositional phrases by identifying prepositions and their objects."""def get_phrase_tokens(prep_token) -> Set:"""Get all tokens in the prepositional phrase by walking the dependency tree.""" phrase_tokens =set([prep_token])# Get the noun that this preposition modifies (its head)if prep_token.dep_ =="case"and prep_token.head.pos_ in ["NOUN", "PROPN", "PRON"]: noun = prep_token.head phrase_tokens.add(noun)# Get all dependents of the nounfor child in noun.children:if child != prep_token: # avoid adding preposition twice phrase_tokens.add(child)return phrase_tokens doc = doclike.docifnot doc.has_annotation("DEP"):raiseValueError(Errors.E029)ifnotlen(doc):return pp_label = doc.vocab.strings.add("PP") prev_right =-1for token in doclike:if token.pos == ADP and token.dep_ =="case": phrase_tokens = get_phrase_tokens(token)if phrase_tokens: left =min(t.i for t in phrase_tokens) right =max(t.i for t in phrase_tokens)# Skip if overlaps with previous phraseif left <= prev_right:continue# Adjust bounds to skip punctuationwhile left < right and doc[left].pos == PUNCT: left +=1while right > left and doc[right].pos == PUNCT: right -=1yield left, right +1, pp_label prev_right = right
# Helper function for displaying prepositional phrasesdef display_phrases_markdown(text: str, doc: Doc) ->str:"""Display text with prepositional phrases marked in bold."""# Get prepositional phrases spans = [doc[start:end] for start, end, _ in prepositional_phrases(doc)]# Sort spans by start position in reverse order# (to avoid index confusion when inserting markers) spans.sort(key=lambda x: x.start, reverse=True)# Insert markdown markers marked_text = textfor span in spans: marked_text = ( marked_text[:span.start_char] +f"**{span.text}**"+ marked_text[span.end_char:] )return marked_text# Test with passage from RFFtest_texts = ["Haec narrantur a poetis de Perseo.","Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat.","Comprehendit igitur Perseum adhuc infantem, et cum matre in arca lignea inclusit.","Tum arcam ipsam in mare coniecit.","Perseus autem in sinu matris dormiebat."]# Display resultsprint("Prepositional phrases marked with asterisks...")for text in test_texts: doc = nlp(text) marked = display_phrases_markdown(text, doc)print(f"- {marked}")
Prepositional phrases marked with asterisks...
- Haec narrantur **a poetis** **de Perseo**.
- Acrisius volebat Perseum nepotem suum necare; nam **propter oraculum** puerum timebat.
- Comprehendit igitur Perseum adhuc infantem, et **cum matre** **in arca lignea** inclusit.
- Tum arcam ipsam **in mare** coniecit.
- Perseus autem **in sinu matris** dormiebat.
Cum clauses and prepositional phrases are not the only type of syntactical structure that Latinists are interested. I invite readers to think about what the formal definitions with respect to LatinCy annotations would be for other kinds of structures, e.g. ablative absolutes, indirect statement, purpose clauses, etc., and see if they can be productively extracted using this kind of syntax iteration.