4  Sentence segmentation

4.1 Sentence segmentation with LatinCy

Sentence segmentation is the task of splitting a text into sentences. For the LatinCy models, this is a task been trained using spaCy’s senter factory to terminate sentences at both strong and weak stops, following the example of Clayman (1981) (see also, Wake (1957), Janson (1964)), who writes: “If all stops are made equivalent, i.e. if no distinction is made between the strong stop, weak stop and interrogation mark, editorial differences will be kept to a minimum.”

Given a spaCy Doc, the sents attribute will produce a generator object with the sentence from that document as determined by the dependency parser. Each sentence is a Span object with the start and end token indices from the original Doc.

# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_lg')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
sents = doc.sents
print(type(sents))
<class 'generator'>

Like all Span objects, the text from each sentence can be retrieved with the text attribute. For convenience below, we convert the generator to list so that we can iterate over it multiple times. Here are the three (3) sentences identified in the example text as well as an indication of the sentences’ type, i.e. <class 'spacy.tokens.span.Span'>.

sents = list(sents)

for i, sent in enumerate(sents, 1):
    print(f'{i}: {sent.text}')
1: Haec narrantur a poetis de Perseo.
2: Perseus filius erat Iovis, maximi deorum.
3: Avus eius Acrisius appellabatur.
sent = sents[0]
print(type(sent))
<class 'spacy.tokens.span.Span'>

Sentences have the same atrributes/methods available to them as any span (listed in the next cell). Following are some attibutes/methods that may be particularly relevant to working with sentences.

sent_methods = [item for item in dir(sent) if '_' not in item]
pprint(sent_methods)
['conjuncts',
 'doc',
 'end',
 'ents',
 'id',
 'label',
 'lefts',
 'rights',
 'root',
 'sent',
 'sentiment',
 'sents',
 'similarity',
 'start',
 'subtree',
 'tensor',
 'text',
 'vector',
 'vocab']

You can identify the root of the sentence as determined by the dependency parser. Assuming the parsing in correct, this will be the main verb of the sentence.

print(sent.root)
narrantur

Each word in the sentence has an associated vector. Sentence (any Span in fact) has an associated vector as well that is the mean of the vectors of the words in the sentence. As this example uses the lg model, the vector has a length of 300.

print(sent.vector.shape)
(300,)

This vector then can be used to compute the similarity between two sentences. Here we see our example sentence compared to two related sentence: 1. a sentence where the character referred to is changed from Perseus to Ulysses; and 2. the active-verb version of the sentence.

sent.similarity(nlp('Haec narrantur a poetis de Ulixe.'))
0.9814933448498585
sent.similarity(nlp('Haec narrant poetae de Perseo.'))
0.7961655550941479

We can retrieve the start and end indices from the original document for each sentence.

sent_2 = sents[1]
start = sent_2.start
end = sent_2.end
print(start)
print(end)
print(sent_2.text)
print(doc[start:end].text)
7
15
Perseus filius erat Iovis, maximi deorum.
Perseus filius erat Iovis, maximi deorum.

References

SLP Chapter 2, Section 2.4.5 “Sentence Segmentation”, pp. 24 [link] (https://web.stanford.edu/~jurafsky/slp3/)
spaCy SentenceRecognizer