Noun chunking is the task of identifying a noun and the words directly related to it. More specifically, with respect to the LatinCy pipeline, it is using annotations of the POS-tagger (tagger) and the dependency parser (parser) to identify all children of a token tagged as NOUN. Unlike the NLP tasks in the preceding sections, this task cannot be associated directly with a pipeline component; rather it is a special Span case defined in the spaCy language model itself under syntax iterators.
# Imports & setupimport spacyfrom pprint import pprintnlp = spacy.load('la_core_web_lg')text ="Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur."doc = nlp(text)print(doc)
/Users/pjb311/.venvs/latincy/lib/python3.11/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'la_core_web_lg' (3.8.0) was trained with spaCy v3.8.3 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
Haec narrantur a poetis de Perseo. Perseus filius erat Iovis, maximi deorum. Avus eius Acrisius appellabatur.
As defined in the syntax_iterators, noun chunks become attributes of the spaCy Doc object. Noun chunks will return any token span for which the dependency head has the NOUN tag. A one-word noun chunk is just the noun itself, so not of particular interest. So, really what we are looking for are longer spans of noun-related text.
# Print the noun chunks in the textfor noun_chunk in doc.noun_chunks:iflen(noun_chunk) >1:print(noun_chunk)