Python | Stemming words with NLTK

Last Updated : 11 Jul, 2025

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, and “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Prerequisite: Introduction to Stemming

Some more example of stemming for root word "like" include:

-> "likes"
-> "liked"
-> "likely"
-> "liking"

Errors in Stemming: There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed from the same root that are of different stems. Under-stemming occurs when two words are stemmed from the same root that is not of different stems.

Applications of stemming are:

Stemming is used in information retrieval systems like search engines.
It is used to determine domain vocabularies in domain analysis.

Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

Below is the implementation of stemming words using NLTK:

Code #1:

Output:

program : program
programs : program
programmer : program
programming : program
programmers : program

Code #2: Stemming words from sentences

Output :

Programmers : program
program : program
with : with
programming : program
languages : language

Code #3: Using reduce():

Algorithm :

Import the necessary modules: PorterStemmer and word_tokenize from nltk, and reduce from functools.
Create an instance of the PorterStemmer class.
Define a sample sentence to be stemmed.
Tokenize the sentence into individual words using word_tokenize.
Use reduce to apply the PorterStemmer to each word in the tokenized sentence, and join the stemmed words back into a string.
Print the stemmed sentence.

install the pip install nltk

Output:

Programm program with program language

Time complexity:
The time complexity of this code is O(nlogn), where n is the length of the input sentence. The tokenizer and stemmer functions have a linear time complexity of O(n), but the reduce function has a logarithmic time complexity of O(logn) since it processes elements in pairs.

Space complexity:
The space complexity of this code is O(n), where n is the length of the input sentence. This is because the reduce function creates a new string object that has the same length as the input sentence. The tokenizer and stemmer functions do not increase the space complexity significantly.

Comment

Article Tags:

Machine Learning

python

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/python-stemming-words-with-nltk/

⇱ Python | Stemming words with NLTK - GeeksforGeeks

Python | Stemming words with NLTK

Explore