# Imports & setupimport spacyimport numpy as npfrom tabulate import tabulatefrom pprint import pprintnlp = spacy.load('la_core_web_trf')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat."doc = nlp(text)print(doc)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum; avus eius Acrisius appellabatur. Acrisius volebat Perseum nepotem suum necare; nam propter oraculum puerum timebat.
As noted in the previous notebook, the trf vectors work slightly differently than vectors in the other pipelines. This notebook covers the specifics of working with the trf contextual vectors.
Here is the vector for the first token ‘Haec’ in the text given above…
tokens = [token for token in doc]for token in tokens:print(token.text)print(token.vector[:10], "etc.")break
Haec
[-0.2595865 -1.5687511 0.5772662 0.8258832 -0.9833767 -2.6926274
-1.8952293 2.7565694 3.1001453 1.6909373] etc.
The LatinCy trf model uses MultilingualBERT as the basis of its ‘transformer’ component. The trf model has a custom Doc attribute that can give access to the per-token contextual vectors which is in turn mapped back to the vector attribute of the Token object via a user hook. Accordingly, you can just access the vector as you do with the other pipelines, but the output will be contextually informed.
# Here is an example using Acrisius, a word which appears twice in the given text...print("Token.text values for two tokens...")print(doc[17])print(doc[20])# Note that we have a Doc custom attribute `trf_token_vecs` which is a list of the vectors for each token in the document. We can access the vector for a specific token by using the index of the token in the document.print("\nSlices correspondiong to Acrisius from the trf_token_vecs Doc custom attribute...")print(doc._.trf_token_vecs[17][:5])print(doc._.trf_token_vecs[20][:5])# But we can access these contextual vectors directly via the `vector` Token attribute.print("\nToken attribute for the Acrisius slices...")print(doc[17].vector[:5])print(doc[20].vector[:5])print("\nVectors are the same?")print(np.mean(doc[17].vector) == np.mean(doc._.trf_token_vecs[17]))
Token.text values for two tokens...
Acrisius
Acrisius
Slices correspondiong to Acrisius from the trf_token_vecs Doc custom attribute...
[-2.048665 -4.7464104 0.6075096 0.18537973 -4.5588617 ]
[-3.147872 -5.0107517 2.0665998 -0.12608378 -2.9859593 ]
Token attribute for the Acrisius slices...
[-2.048665 -4.7464104 0.6075096 0.18537973 -4.5588617 ]
[-3.147872 -5.0107517 2.0665998 -0.12608378 -2.9859593 ]
Vectors are the same?
True
Reminder that the shape of the (MultilingualBERT-derived) vectors is 768…
print(doc[0].vector.shape)
(768,)
For the trf pipelines, I have (not yet!) implemented a similarity method. Vectors can of course still be compared using other methods, such as cosine similarity. Here is the cosine similarity between the vectors for two sentences using a custom script…
text_1 ="Omnia vincit amor."text_2 ="Omnia vincit amicitia."doc_1 = nlp(text_1)doc_2 = nlp(text_2)def cosine_similarity(vec1, vec2):# Ensure vectors are flattened numpy arrays vec1_flat = vec1.reshape(-1) vec2_flat = vec2.reshape(-1)# Compute cosine similarity dot_product = np.dot(vec1_flat, vec2_flat) norm1 = np.linalg.norm(vec1_flat) norm2 = np.linalg.norm(vec2_flat)return dot_product / (norm1 * norm2)# Get the vectors and compute similarityvec1 = doc_1._.trf_token_vecsvec2 = doc_2._.trf_token_vecsprint("\nSimilarity between the two sentences...")similarity = cosine_similarity(vec1, vec2)print(similarity)
Similarity between the two sentences...
0.9051101
References
SLP Chapter 6, “Vector Semantics and Embeddings” link