Sentence segmentation is the task of splitting a text into sentences. For the LatinCy models, this is a task been trained using spaCy’s senter factory to terminate sentences at both strong and weak stops, following the example of Clayman (1981) (see also, Wake (1957), Janson (1964)), who writes: “If all stops are made equivalent, i.e. if no distinction is made between the strong stop, weak stop and interrogation mark, editorial differences will be kept to a minimum.”
Given a spaCy Doc, the sents attribute will produce a generator object with the sentence from that document as determined by the dependency parser. Each sentence is a Span object with the start and end token indices from the original Doc.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
sents = doc.sentsprint(type(sents))
<class 'generator'>
Like all Span objects, the text from each sentence can be retrieved with the text attribute. For convenience below, we convert the generator to list so that we can iterate over it multiple times. Here are the three (3) sentences identified in the example text as well as an indication of the sentences’ type, i.e. <class 'spacy.tokens.span.Span'>.
sents =list(sents)for i, sent inenumerate(sents, 1):print(f'{i}: {sent.text}')
1: Haec narrantur a poetis de Perseo.
2: Perseus filius erat Iovis, maximi deorum.
3: Avus eius Acrisius appellabatur.
sent = sents[0]print(type(sent))
<class 'spacy.tokens.span.Span'>
Sentences have the same atrributes/methods available to them as any span (listed in the next cell). Following are some attibutes/methods that may be particularly relevant to working with sentences.
sent_methods = [item for item indir(sent) if'_'notin item]pprint(sent_methods)
You can identify the root of the sentence as determined by the dependency parser. Assuming the parsing in correct, this will be the main verb of the sentence.
print(sent.root)
narrantur
Each word in the sentence has an associated vector. Sentence (any Span in fact) has an associated vector as well that is the mean of the vectors of the words in the sentence. As this example uses the lg model, the vector has a length of 300.
print(sent.vector.shape)
(300,)
This vector then can be used to compute the similarity between two sentences. Here we see our example sentence compared to two related sentence: 1. a sentence where the character referred to is changed from Perseus to Ulysses; and 2. the active-verb version of the sentence.
sent.similarity(nlp('Haec narrantur a poetis de Ulixe.'))
0.9814933448498585
sent.similarity(nlp('Haec narrant poetae de Perseo.'))
0.7961655550941479
We can retrieve the start and end indices from the original document for each sentence.
Wake, William C. 1957. “Sentence-Length Distributions of Greek Authors.”Journal of the Royal Statistical Society. Series A (General) 120 (3): 331–46. https://www.jstor.org/stable/2343104.