Lemmatization with NLTK

Last Updated : 28 May, 2026

Lemmatization is a text preprocessing technique in Natural Language Processing (NLP) that converts words into their base or dictionary form called a lemma. Unlike stemming, it considers the meaning and part of speech of words, making the output more accurate and meaningful.

👁 lemmatization

Lemmatization

Lemmatization Techniques

There are different techniques to perform lemmatization each with its own advantages and use cases

1. Rule Based Lemmatization

In rule-based lemmatization, predefined rules are applied to a word to remove suffixes and get the root form. This approach works well for regular words but may not handle irregularities well.

For example:

Rule: For regular verbs ending in "-ed," remove the "-ed" suffix.
Example: "walked" -> "walk"

While this method is simple and interpretable, it doesn't account for irregular word forms like "better" which should be lemmatized to "good".

2. Dictionary-Based Lemmatization

It uses a predefined dictionary or lexicon such as WordNet to look up the base form of a word. This method is more accurate than rule-based lemmatization because it accounts for exceptions and irregular words.

For example:

'running' -> 'run'
'better' -> 'good'
'went' -> 'go
"I was running to become a better athlete and then I went home," -> "I was run to become a good athlete and then I go home."

By using dictionaries like WordNet this method can handle a range of words effectively, especially in languages with well-established dictionaries.

3. Machine Learning-Based Lemmatization

It uses algorithms trained on large datasets to automatically identify the base form of words. This approach is highly flexible and can handle irregular words and linguistic nuances better than the rule-based and dictionary-based methods.

For example:

A trained model may deduce that “went” corresponds to “go” even though the suffix removal rule doesn’t apply. Similarly, for 'happier' the model deduces 'happy' as the lemma.

Machine learning-based lemmatizers are more adaptive and can generalize across different word forms which makes them ideal for complex tasks involving diverse vocabularies.

Implementation of Lemmatization in Python

Lets see step by step how Lemmatization works in Python:

Step 1: Installing NLTK and Downloading Necessary Resources

In Python, the NLTK library provides an easy and efficient way to implement lemmatization. First, we need to install the NLTK library and download the necessary datasets like WordNet and the punkt tokenizer.

Now lets import the library and download the necessary datasets.

Step 2: Lemmatizing Text with NLTK

Now we can tokenize the text and apply lemmatization using NLTK's WordNetLemmatizer.

Output:

👁 nltk1

Lemmatizing Text with NLTK

In this output, we can see that:

"cats" is reduced to its lemma "cat" (noun).
"running" remains "running" (since no POS tag is provided, NLTK doesn't convert it to "run").

Step 3: Improving Lemmatization with Part of Speech (POS) Tagging

To improve the accuracy of lemmatization, it’s important to specify the correct Part of Speech (POS) for each word. By default, NLTK assumes that words are nouns when no POS tag is provided. However, it can be more accurate if we specify the correct POS tag for each word.

For example:

"running" (as a verb) should be lemmatized to "run".
"better" (as an adjective) should be lemmatized to "good".

Output:

👁 nltk2

Improving Lemmatization with POS Tagging

In this improved version:

"children" is lemmatized to "child" (noun).
"running" is lemmatized to "run" (verb).
"better" is lemmatized to "good" (adjective).

Download code from here

Advantages

Reduces the number of unique words in datasets
Improves memory and computational efficiency
Enhances search and information retrieval accuracy
Makes text data more consistent for NLP models
Improves prediction accuracy and context understanding

Disadvantages

Slower than stemming because of dictionary and grammar analysis
Not ideal for real-time applications requiring fast processing
Can produce ambiguous results for words with multiple meanings
Requires more computational resources compared to simpler techniques

Python - Lemmatization Approaches with Examples
Python | Named Entity Recognition (NER) using spaCy
Python | PoS Tagging and Lemmatization using spaCy
Removing stop words with NLTK in Python

Comment

Article Tags:

Python

python

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/python-lemmatization-with-nltk/