# Imports & setupimport spacyimport numpy as npfrom tabulate import tabulatefrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur.
Once the spaCy pipeline has been run on a text, every token is assigned a vector representation: in the sm/md/lg models, this is a 300-dimensional vector assigned by the tok2vec component and in the trf model, this is a 768-dimensional vector assigned by a trf_vectors custom component and informed by transformer component. These vectors can be accessed through the vector attribute of the token. The trf behavior is slightly different that the other models and so is covered in the next notebook. Also, note that to save space within the notebook, only the first 10 dimensions of the vector are displayed in the examples below.
Here is the vector for the first token ‘Haec’ in the text given above…
tokens = [token for token in doc]for token in tokens:print(token.text)print(token.vector[:10], "etc.")break
Haec
[-1.9039791 -1.2126855 1.3490605 3.1879 -0.9067255 -0.7610278
-0.739699 -4.3730483 1.1556364 0.19760956] etc.
And here is a list of the first 10 tokens in the text, each with the first 10 dimensions of their corresponding vector…
vector_example_data = []for token in tokens[:10]: vector_example_data.append((token.text, *token.vector[:10]))print(tabulate(vector_example_data))
We can get the vector length from the shape attribute…
print(doc[0].vector.shape)
(300,)
LatinCy will give you mean vector representations for spans of text, like in the sample sentence below.
sents = [sent for sent in doc.sents]for sent in sents:print(sent)print(sent.vector[:10])break
Haec narrantur a poetis de Perseo.
[-2.0929456 -1.0989287 0.29314348 1.3037221 0.78510916 -0.38522485
-0.01578556 -0.56820565 0.611749 1.2512217 ]
For the sm/md/lg pipelines, we can compute the similarity between vectors using the similarity method of the spaCy objects. Here is the similarity between two sentences…
text_1 ="Omnia vincit amor."text_2 ="Omnia vincit amicitia."doc_1 = nlp(text_1)doc_2 = nlp(text_2)print("\nSimilarity between the two sentences...")similarity = doc_1.similarity(doc_2)print(similarity)
Similarity between the two sentences...
0.8994302985271659
References
SLP Chapter 6, “Vector Semantics and Embeddings” link