10 Vectors

10.1 Accessing Vectors with LatinCy

# Imports & setup

import spacy
import numpy as np
from tabulate import tabulate
from pprint import pprint
nlp = spacy.load('la_core_web_lg')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur."
doc = nlp(text)
print(doc)

/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'la_core_web_lg' (3.8.0) was trained with spaCy v3.8.3 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur.

Once the spaCy pipeline has been run on a text, every token is assigned a vector representation: in the sm/md/lg models, this is a 300-dimensional vector assigned by the tok2vec component and in the trf model, this is a 768-dimensional vector assigned by a trf_vectors custom component and informed by transformer component. These vectors can be accessed through the vector attribute of the token. The trf behavior is slightly different that the other models and so is covered in the next notebook. Also, note that to save space within the notebook, only the first 10 dimensions of the vector are displayed in the examples below.

Here is the vector for the first token ‘Haec’ in the text given above…

tokens = [token for token in doc]
for token in tokens:
    print(token.text)
    print(token.vector[:10], "etc.")
    break

Haec
[-1.9039791  -1.2126855   1.3490605   3.1879     -0.9067255  -0.7610278
 -0.739699   -4.3730483   1.1556364   0.19760956] etc.

And here is a list of the first 10 tokens in the text, each with the first 10 dimensions of their corresponding vector…

vector_example_data = []
for token in tokens[:10]:
    vector_example_data.append((token.text, *token.vector[:10]))

print(tabulate(vector_example_data))

---------  ---------  ---------  ----------  ----------  ---------  ---------  ---------  ---------  ----------  ---------
Haec       -1.90398   -1.21269    1.34906     3.1879     -0.906726  -0.761028  -0.739699  -4.37305    1.15564     0.19761
narrantur   1.91623   -1.24723    0.518828   -1.91084     0.620206  -1.79221    1.13974   -1.06108    0.0448955   1.3394
a          -1.84657   -3.2745    -3.66575     7.81755     5.3394     0.80885   -5.5981    -0.45535    6.6346      2.9247
poetis     -1.66618    2.9941     1.54732    -2.56839    -0.58933    0.926527   0.50731   -0.453224  -0.675591    0.850029
de         -5.76823    1.33514   -5.87731     1.75111    -1.87804    3.54353   -0.989212   0.516913  -0.477756    6.99891
Perseo     -0.459438  -1.59377   -1.16804    -0.679769    2.27411   -2.5005    -0.349883   2.4222     0.295428   -0.46593
.          -4.92245   -4.69355    9.3479      1.52849     0.63615   -2.92175    5.91935   -0.57385   -2.69497    -3.08617
Perseus    -0.246562   1.56273   -1.13213    -1.47479     0.800228  -1.58815   -0.617385   2.86749   -0.0864839  -1.48492
filius     -0.965762  -0.348192   0.0142697  -1.35208    -0.509385  -1.4312     0.408665   2.9386    -1.19688     0.219843
erat        3.3229     1.25453   -5.21225    -0.0306687  -1.7512     1.99759   -1.59149    1.36364   -0.949415   -0.050509
---------  ---------  ---------  ----------  ----------  ---------  ---------  ---------  ---------  ----------  ---------

We can get the vector length from the shape attribute…

print(doc[0].vector.shape)

(300,)

LatinCy will give you mean vector representations for spans of text, like in the sample sentence below.

sents = [sent for sent in doc.sents]

for sent in sents:
    print(sent)
    print(sent.vector[:10])
    break

Haec narrantur a poetis de Perseo.
[-2.0929456  -1.0989287   0.29314348  1.3037221   0.78510916 -0.38522485
 -0.01578556 -0.56820565  0.611749    1.2512217 ]

For the sm/md/lg pipelines, we can compute the similarity between vectors using the similarity method of the spaCy objects. Here is the similarity between two sentences…

text_1 = "Omnia vincit amor."
text_2 = "Omnia vincit amicitia."

doc_1 = nlp(text_1)
doc_2 = nlp(text_2)

print("\nSimilarity between the two sentences...")
similarity = doc_1.similarity(doc_2)
print(similarity)


Similarity between the two sentences...
0.8994302985271659

References

SLP Chapter 6, “Vector Semantics and Embeddings” link