9  Named Entity Recognition

9.1 Named Entity Recognition with LatinCy

Named entity recognition is the task of mapping a token (or span of tokens) in a text to a kind of categorical tag, e.g. mapping the token Roma to a tag indicating location. The default LatinCy pipelines have a ner component that assigns the following tags:

  • PERSON for the names of people
  • LOC for the names of locations
  • NORP for the names of groups of people
# Imports & setup

import spacy
from pprint import pprint
nlp = spacy.load('la_core_web_lg')
# cf. Caesar, BC 3.1
text = "Dictatore habente comitia Caesare consules creantur Iulius Caesar et P. Servilius."
doc = nlp(text)
print(doc)
Dictatore habente comitia Caesare consules creantur Iulius Caesar et P. Servilius.
import tabulate

data = []
tokens = [item for item in doc]

for token in tokens:
    data.append([token.text, token.ent_type_])    

print(tabulate.tabulate(data, headers=['Text', "Entity Type"])) 
Text       Entity Type
---------  -------------
Dictatore
habente
comitia
Caesare    PERSON
consules
creantur
Iulius     PERSON
Caesar     PERSON
et
P.         PERSON
Servilius  PERSON
.

Entity are annotated in the Token objects and are stored in the ent_type_ attribute.

sample_token = doc[7]

print(f'Sample token: {sample_token.text}')
print(f'Sample entity type: {sample_token.ent_type_}')
Sample token: Caesar
Sample entity type: PERSON

Moreover, the ner component should be able to recover the entity spans from the text using IOB (i.e. inside/outside/beginning) annotations. The IOB annotations are stored in the ent_iob_ attribute of a spaCy Token object.

In the example below, the PERSON named entity is “Julius Caesar”. The word “et” is not part of the entity. Accordingly, “Julius” is tagged with B-PERSON (beginning of the entity), “Caesar” is tagged with I-PERSON (inside the entity, and in this case the last word in the entity), and “et” is tagged with O (outside the entity, and so closing off the preceding entity).


print(f'Sample token: {doc[6].text}')
print(f'Sample entity type: {doc[6].ent_type_}')
print(f'Sample entity IOB: {doc[6].ent_iob_}')
print(f"Sample entity IOB-type: {doc[6].ent_iob_}-{doc[6].ent_type_}", end='\n\n')

print(f'Sample token: {sample_token.text}')
print(f'Sample entity type: {sample_token.ent_type_}')
print(f'Sample entity IOB: {sample_token.ent_iob_}')
print(f"Sample entity IOB-type: {sample_token.ent_iob_}-{sample_token.ent_type_}", end='\n\n')

print(f'Sample token: {doc[8].text}')
print(f'Sample entity type: {doc[8].ent_type_}')
print(f'Sample entity IOB: {doc[8].ent_iob_}')
Sample token: Iulius
Sample entity type: PERSON
Sample entity IOB: B
Sample entity IOB-type: B-PERSON

Sample token: Caesar
Sample entity type: PERSON
Sample entity IOB: I
Sample entity IOB-type: I-PERSON

Sample token: et
Sample entity type: 
Sample entity IOB: O

LatinCy does not (yet) offer a spaCy EntityLinker, that is a model-based tagger for resolving named entities to specific referent in a knowledge base. The wikidata identifier for Julius Caesar is [https://www.wikidata.org/wiki/Q1048] and an EntityLinker (if it existed!) could annotate the entity span with this identifier. For now, it will suffice to note that spaCy Token objects have an attribute set aside for this purpose: ent_kb_id_.

sample_token = doc[7]
print(f'Sample token: {sample_token.text}')
print(f'Sample entity type: {sample_token.ent_type_}')
sample_token.ent_kb_id_ = "https://www.wikidata.org/wiki/Q1048"
print(f'Sample entity KB ID: {sample_token.ent_kb_id_}')
Sample token: Caesar
Sample entity type: PERSON
Sample entity KB ID: https://www.wikidata.org/wiki/Q1048

SpaCy has a visualizer available for named entities; this is covered in the displaCy notebook.

References

spaCy EntityRecognizer