11 TRF Vectors

11.1 Accessing TRF Vectors with LatinCy

# Imports & setup

import spacy
import numpy as np
from tabulate import tabulate
from pprint import pprint
nlp = spacy.load('la_core_web_trf')
text = "Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat."
doc = nlp(text)
print(doc)

/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'la_core_web_trf' (3.8.0) was trained with spaCy v3.8.3 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/thinc/shims/pytorch.py:114: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(self._mixed_precision):

Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat.

As noted in the previous notebook, the trf vectors work slightly differently than vectors in the other pipelines. This notebook covers the specifics of working with the trf contextual vectors.

Here is the vector for the first token ‘Haec’ in the text given above…

tokens = [token for token in doc]
for token in tokens:
    print(token.text)
    print(token.vector[:10], "etc.")
    break

Haec
[-0.2595865 -1.5687511  0.5772662  0.8258832 -0.9833767 -2.6926274
 -1.8952293  2.7565694  3.1001453  1.6909373] etc.

The LatinCy trf model uses MultilingualBERT as the basis of its ‘transformer’ component. The trf model has a custom Doc attribute that can give access to the per-token contextual vectors which is in turn mapped back to the vector attribute of the Token object via a user hook. Accordingly, you can just access the vector as you do with the other pipelines, but the output will be contextually informed.

# Here is an example using Acrisius, a word which appears twice in the given text...

print("Token.text values for two tokens...")
print(doc[17])
print(doc[20])

# Note that we have a Doc custom attribute `trf_token_vecs` which is a list of the vectors for each token in the document. We can access the vector for a specific token by using the index of the token in the document.

print("\nSlices correspondiong to Acrisius from the trf_token_vecs Doc custom attribute...")
print(doc._.trf_token_vecs[17][:5])
print(doc._.trf_token_vecs[20][:5])

# But we can access these contextual vectors directly via the `vector` Token attribute.
print("\nToken attribute for the Acrisius slices...")
print(doc[17].vector[:5])
print(doc[20].vector[:5])

print("\nVectors are the same?")
print(np.mean(doc[17].vector) == np.mean(doc._.trf_token_vecs[17]))

Token.text values for two tokens...
Acrisius
Acrisius

Slices correspondiong to Acrisius from the trf_token_vecs Doc custom attribute...
[-2.048665   -4.7464104   0.6075096   0.18537973 -4.5588617 ]
[-3.147872   -5.0107517   2.0665998  -0.12608378 -2.9859593 ]

Token attribute for the Acrisius slices...
[-2.048665   -4.7464104   0.6075096   0.18537973 -4.5588617 ]
[-3.147872   -5.0107517   2.0665998  -0.12608378 -2.9859593 ]

Vectors are the same?
True

Reminder that the shape of the (MultilingualBERT-derived) vectors is 768…

print(doc[0].vector.shape)

(768,)

For the trf pipelines, I have (not yet!) implemented a similarity method. Vectors can of course still be compared using other methods, such as cosine similarity. Here is the cosine similarity between the vectors for two sentences using a custom script…

text_1 = "Omnia vincit amor."
text_2 = "Omnia vincit amicitia."

doc_1 = nlp(text_1)
doc_2 = nlp(text_2)

def cosine_similarity(vec1, vec2):
    # Ensure vectors are flattened numpy arrays
    vec1_flat = vec1.reshape(-1)
    vec2_flat = vec2.reshape(-1)
    
    # Compute cosine similarity
    dot_product = np.dot(vec1_flat, vec2_flat)
    norm1 = np.linalg.norm(vec1_flat)
    norm2 = np.linalg.norm(vec2_flat)
    
    return dot_product / (norm1 * norm2)

# Get the vectors and compute similarity
vec1 = doc_1._.trf_token_vecs
vec2 = doc_2._.trf_token_vecs

print("\nSimilarity between the two sentences...")
similarity = cosine_similarity(vec1, vec2)
print(similarity)


Similarity between the two sentences...
0.9051101

References

SLP Chapter 6, “Vector Semantics and Embeddings” link