Named entity recognition is the task of mapping a token (or span of tokens) in a text to a kind of categorical tag, e.g. mapping the token Roma to a tag indicating location. The default LatinCy pipelines have a ner component that assigns the following tags:
PERSON for the names of people
LOC for the names of locations
NORP for the names of groups of people
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_lg')# cf. Caesar, BC 3.1text ="Dictatore habente comitia Caesare consules creantur Iulius Caesar et P. Servilius."doc = nlp(text)print(doc)
Dictatore habente comitia Caesare consules creantur Iulius Caesar et P. Servilius.
import tabulatedata = []tokens = [item for item in doc]for token in tokens: data.append([token.text, token.ent_type_]) print(tabulate.tabulate(data, headers=['Text', "Entity Type"]))
Text Entity Type
--------- -------------
Dictatore
habente
comitia
Caesare PERSON
consules
creantur
Iulius PERSON
Caesar PERSON
et
P. PERSON
Servilius PERSON
.
Entity are annotated in the Token objects and are stored in the ent_type_ attribute.
Moreover, the ner component should be able to recover the entity spans from the text using IOB (i.e. inside/outside/beginning) annotations. The IOB annotations are stored in the ent_iob_ attribute of a spaCy Token object.
In the example below, the PERSON named entity is “Julius Caesar”. The word “et” is not part of the entity. Accordingly, “Julius” is tagged with B-PERSON (beginning of the entity), “Caesar” is tagged with I-PERSON (inside the entity, and in this case the last word in the entity), and “et” is tagged with O (outside the entity, and so closing off the preceding entity).
Sample token: Iulius
Sample entity type: PERSON
Sample entity IOB: B
Sample entity IOB-type: B-PERSON
Sample token: Caesar
Sample entity type: PERSON
Sample entity IOB: I
Sample entity IOB-type: I-PERSON
Sample token: et
Sample entity type:
Sample entity IOB: O
LatinCy does not (yet) offer a spaCy EntityLinker, that is a model-based tagger for resolving named entities to specific referent in a knowledge base. The wikidata identifier for Julius Caesar is [https://www.wikidata.org/wiki/Q1048] and an EntityLinker (if it existed!) could annotate the entity span with this identifier. For now, it will suffice to note that spaCy Token objects have an attribute set aside for this purpose: ent_kb_id_.