VOOZH about

URL: https://www.geeksforgeeks.org/nlp/unsupervised-noun-extraction-in-nlp/

⇱ Unsupervised Noun Extraction in NLP - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Unsupervised Noun Extraction in NLP

Last Updated : 23 Jul, 2025

Unsupervised noun extraction is a technique in Natural Language Processing (NLP) used to identify and extract nouns from text without relying on labelled training data. Instead, it leverages statistical and linguistic patterns to detect noun phrases. This approach is particularly valuable for processing large volumes of text where manual annotation is impractical.

In this article, we will explore several methods used in unsupervised noun extraction, including branching entropy, accessor variety, and cohesion score. We will also implement these techniques using Python to gain hands-on experience.

What is Unsupervised Noun Extraction?

Unsupervised noun extraction extracts the statistical properties of a language to identify nouns without relying on labelled datasets. This approach can also be particularly helpful for languages where labelled data is unavailable or very limited. By performing tasks like word co-occurrences, and context distributions, unsupervised methods can detect noun phrases easily. All these techniques have metrics like entropy, mutual information and accessor variety to evaluate the likelihood of a word being a noun based on its contextual usage. We will cover all these metrics in deep in the next sections.

One of the main benefits of using unsupervised noun extraction is its scalability, meaning we can process large textual documents easily, where as, traditional supervised methods require manual annotation, which is both time-consuming and cost-ineffective.

Branching Entropy for Unsupervised Noun Extraction in NLP

Branching entropy is a concept used in natural language processing (NLP) for unsupervised noun extraction. It helps identify noun phrases by measuring the uncertainty or randomness in the continuation of a word sequence. Specifically, branching entropy can be used to determine the boundaries of noun phrases by analyzing the likelihood of different words following a given word or sequence of words.

Branching entropy is derived from the Shannon entropy formula and is used to measure the uncertainty of word occurrences in different contexts.

There are two types of branching entropy:

  1. Forward Branching Entropy: Measures the uncertainty in the possible continuations of a word or sequence.
  2. Backward Branching Entropy: Measures the uncertainty in the possible preceding words of a word or sequence.

The entropy value is higher when there are many possible continuations (or preceding words) with relatively equal probabilities, indicating high uncertainty. Conversely, a lower entropy value indicates fewer possible continuations, suggesting more certainty.

Calculating Branching Entropy

Forward Branching Entropy

Given a word w in a sequence, the forward branching entropy is calculated as:

where is the conditional probability of the word following the word w.

Backward Branching Entropy

Similarly, the backward branching entropy is calculated as:

where is the conditional probability of the word preceding the word w.

Unsupervised Noun Extraction Using Branching Entropy in Python

  1. Setup and Import Libraries:
    • Import necessary libraries such as nltk, Counter, and math.
    • Download required NLTK resources (tokenizers, POS taggers, and stopwords).
  2. Define Helper Function to Calculate Entropy: Implement calculate_entropy function to compute entropy based on word counts.
  3. Tokenize and Preprocess Text:
    • Tokenize the input text into words.
    • Remove stopwords from the tokenized words.
  4. Perform POS Tagging:
    • Use NLTK’s POS tagger to tag the filtered tokens.
    • Focus on nouns (tags starting with 'NN').
  5. Create Co-occurrence Matrix: Construct a co-occurrence matrix to count how often each word appears with its context words within a specified window size.
  6. Calculate Branching Entropy for Each Noun: Calculate the entropy for each noun based on its context word counts.
  7. Set Entropy Threshold and Extract Nouns:
    • Determine a threshold for noun extraction based on the calculated entropies.
    • Extract nouns that meet or exceed the threshold.
  8. Test the Function: Apply the function to a sample text and print the extracted nouns and their branching entropies.

Output:

Extracted Nouns: ['brown', 'fox', 'jumps', 'dog']
Nouns with Branching Entropy: {'brown': 2.584962500721156, 'fox': 2.584962500721156, 'jumps': 2.584962500721156, 'dog': 2.584962500721156}

The current output highlights the importance of context size and POS tagging accuracy in noun extraction using branching entropy. With a larger corpus and more refined tagging, the results would likely be more accurate and varied.

Accessor Variety for Unsupervised Noun Extraction in NLP

Accessor Variety (AV) is another concept used in NLP for unsupervised noun extraction. It measures the diversity of contexts (both preceding and following words) in which a candidate word appears. The idea is that nouns often appear in a variety of contexts, while other types of words (like function words) do not.

Concept of Accessor Variety

  1. Forward Accessor Variety (FAV): The number of unique words that follow a given word.
  2. Backward Accessor Variety (BAV): The number of unique words that precede a given word.

The total accessor variety (TAV) is the sum of FAV and BAV:

By calculating the AV values for words, we can identify candidate nouns based on their contextual diversity. Words with high AV values are more likely to be nouns.

Unsupervised Noun Extraction Using Accessor Variety in Python

  1. Setup and Import Libraries:
    • Import necessary libraries such as nltk, Counter, and math.
    • Download required NLTK resources (tokenizers, POS taggers, and stopwords).
  2. Tokenize and Preprocess Text:
    • Tokenize the input text into words.
    • Remove stopwords from the tokenized words.
  3. Perform POS Tagging:
    • Use NLTK’s POS tagger to tag the filtered tokens.
    • Focus on nouns (tags starting with 'NN').
  4. Calculate Forward and Backward Accessor Variety: For each word, count the number of unique preceding and following words.
  5. Combine FAV and BAV to Get Total Accessor Variety (TAV): Sum the FAV and BAV for each word to get the total accessor variety.
  6. Set Threshold for Noun Extraction: Determine a threshold for noun extraction based on the AV values.

Output:

Extracted Nouns: ['dog', 'lazy', 'brown', 'jumps', 'fox']
Forward Accessor Variety: {'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1}
Backward Accessor Variety: {'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1, '.': 1}
Total Accessor Variety: {'dog': 2, 'lazy': 2, '.': 1, 'brown': 2, 'quick': 1, 'jumps': 2, 'fox': 2}

The extracted nouns are based on their high accessor variety values, indicating a high diversity of contexts, which is typical for nouns in natural language.

Cohesion Score for Unsupervised Noun Extraction in NLP

Cohesion Score is another method used in NLP for unsupervised noun extraction. It measures the strength of association between words in a phrase or sequence, indicating how likely they are to form a meaningful unit, such as a noun phrase.

Cohesion score is calculated based on the mutual information between adjacent words in a sequence. Mutual Information (MI) measures the degree of association between two words by comparing the observed frequency of their co-occurrence with the frequency expected if the words were independent.

Mutual Information Formula

For two words and ​, the mutual information is calculated as:

Where:

  • is the joint probability of words ​ and ​ occurring together.
  • and are the individual probabilities of words and ​.

Steps for Implementation

  1. Setup and Import Libraries:
    • Import necessary libraries such as nltk, Counter, and math.
    • Download required NLTK resources (tokenizers, POS taggers, and stopwords).
  2. Tokenize and Preprocess Text:
    • Tokenize the input text into words.
    • Remove stopwords from the tokenized words.
  3. Calculate Mutual Information: For each pair of adjacent words, calculate the mutual information.
  4. Calculate Cohesion Score: Sum the mutual information values for all pairs in a candidate phrase to get the cohesion score.
  5. Set Threshold for Noun Extraction: Determine a threshold for noun extraction based on the cohesion scores.

Output:

Extracted Noun Phrases: ['quick brown', 'brown fox', 'fox jumps', 'jumps lazy', 'lazy dog', 'dog .']
Cohesion Scores: {('quick', 'brown'): 2.584962500721156, ('brown', 'fox'): 2.584962500721156, ('fox', 'jumps'): 2.584962500721156, ('jumps', 'lazy'): 2.584962500721156, ('lazy', 'dog'): 2.584962500721156, ('dog', '.'): 2.584962500721156}

For the input text "The quick brown fox jumps over the lazy dog.", the output might include:

  • Extracted Noun Phrases: Phrases identified as noun phrases based on their high cohesion scores.
  • Cohesion Scores: A dictionary of phrases and their corresponding cohesion scores.

Conclusion

Unsupervised noun extraction techniques allow for effective identification of noun phrases without the need for annotated datasets. By utilizing methods such as branching entropy, accessor variety, and cohesion score, we can detect nouns based on contextual usage. Each technique offers unique insights: branching entropy evaluates contextual unpredictability, accessor variety measures context diversity, and cohesion score assesses word associations. These methods collectively enhance our ability to identify and analyze noun phrases in large texts.

Comment
Article Tags:

Explore