# Imports & setupimport spacyimport numpy as npfrom tabulate import tabulatefrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur.
Vectors are used in two different ways in LatinCy. First, a static embedding layer is used during training to initialize what will become a contextual embedding layer during training. Second, this contextual embedding layer—spaCy’s tok2vec component—is used to produce the final token vectors that are used as input features for other components. LatinCy models use Floret vectors for the static embedding layer, both in the md and lg model sizes. The sm model size does not use static vectors at all, relying instead on hash embeddings of word-level features (normalized form, prefix, suffix, shape) to produce the contextual tok2vec output.
Again…
Static floret vectors (300 dimensions) — trained on the LatinCy texts corpus, these assign a fixed vector to each word form. Accessed via token.vector. The same word always receives the same vector, regardless of context.
Contextual tok2vec vectors (96 dimensions) — produced by a CNN that takes the static vectors as input and incorporates surrounding context. Accessed via doc.tensor[token.i]. The same word receives different vectors in different sentences.
New for v3.9: The pretrain initializes the model with information from a large amount of raw text so that the model can learn sentence structure, word cooccurrence, and other general language patterns.
Here is the vector for the first token ‘Haec’ in the text given above, output trunctated to the first 10 dimensions:
tokens = [token for token in doc]for token in tokens:print(token.text)print(token.vector[:10], "etc.")break
Haec
[-1.9039791 -1.2126855 1.3490605 3.1879 -0.9067255 -0.7610278
-0.739699 -4.3730483 1.1556364 0.19760956] etc.
And here is a list of the first 10 tokens, each with the first 10 dimensions of their static floret vector…
vector_example_data = []for token in tokens[:10]: vector_example_data.append((token.text, *token.vector[:10]))print(tabulate(vector_example_data))
We can get the vector length from the shape attribute…
print(doc[0].vector.shape)
(300,)
10.1.1 Static vectors are context-independent
Because token.vector returns static floret vectors, the same word form always produces the same vector — regardless of the sentence it appears in. Here we compare the vector for erat in two different sentences:
# Static floret vectors: same word → same vectordoc_a = nlp("Caesar magnus erat.")doc_b = nlp("Tempus longum erat.")erat_a = [t for t in doc_a if t.text =="erat"][0]erat_b = [t for t in doc_b if t.text =="erat"][0]print(f"'erat' in '{doc_a.text}':")print(erat_a.vector[:10])print(f"\n'erat' in '{doc_b.text}':")print(erat_b.vector[:10])print(f"\nVectors identical? {np.array_equal(erat_a.vector, erat_b.vector)}")
The tok2vec component on the other hand takes the static floret vectors as input and produces contextual representations that incorporate information from surrounding tokens. These are stored in doc.tensor, a matrix with one row per token, and have 96 dimensions in the lg model.
Unlike the static vectors, the contextual tok2vec output for the same word form differs depending on context, though note still the high similarity between the two vectors for erat in the different sentences (cosine similarity of ~0.97).
# Contextual tok2vec output: same word → different vector in different contexttensor_a = doc_a.tensor[erat_a.i]tensor_b = doc_b.tensor[erat_b.i]print(f"tok2vec output for 'erat' in '{doc_a.text}':")print(tensor_a[:10])print(f"\ntok2vec output for 'erat' in '{doc_b.text}':")print(tensor_b[:10])print(f"\nVectors identical? {np.array_equal(tensor_a, tensor_b)}")cos_sim = np.dot(tensor_a, tensor_b) / (np.linalg.norm(tensor_a) * np.linalg.norm(tensor_b))print(f"Cosine similarity: {cos_sim:.4f}")
tok2vec output for 'erat' in 'Caesar magnus erat.':
[-5.5985093 1.6060945 5.2459955 0.73094916 -3.8397129 -4.930661
2.839848 -2.5907235 -2.5042472 -3.0111873 ]
tok2vec output for 'erat' in 'Tempus longum erat.':
[-5.395909 1.074707 3.5050492 1.1230471 -4.0830016 -5.5699005
2.8832095 -2.082502 -2.075153 -4.2625637]
Vectors identical? False
Cosine similarity: 0.9701
10.1.3 Span and document vectors
The span.vector and doc.vector properties return the mean of the statictoken.vector values — not the contextual tok2vec output. These averaged static vectors provide a simple, deterministic representation for a span of text:
sents = [sent for sent in doc.sents]for sent in sents:print(sent)print(sent.vector[:10])break
Haec narrantur a poetis de Perseo.
[-2.0929456 -1.0989287 0.29314348 1.3037221 0.78510916 -0.38522485
-0.01578556 -0.56820565 0.611749 1.2512217 ]
The doc.similarity() method computes cosine similarity between averaged static vectors. Because these are based on static floret embeddings, the similarity score reflects overlap in word forms and their trained associations, not context-dependent meaning:
text_1 ="Omnia vincit amor."text_2 ="Omnia vincit amicitia."doc_1 = nlp(text_1)doc_2 = nlp(text_2)print("\nSimilarity between the two sentences...")similarity = doc_1.similarity(doc_2)print(similarity)
Similarity between the two sentences...
0.8994302985271659
References
SLP Ch. 5, “Embeddings” linkSLP Ch. 10 “Masked Language Models” link