Text Preprocessing in Python | Set 2

Last Updated : 11 Jul, 2025

Text Preprocessing is one of the initial steps of Natural Language Processing (NLP) that involves cleaning and transforming raw data into suitable data for further processing. It enhances the quality of the text makes it easier to work and improves the performance of machine learning models.

In this article, we will look at some more advanced text preprocessing techniques.

Prerequisites

Before starting with this article, you need to go through the Text Preprocessing in Python | Set 1.

Also, refer to this article to learn more about Natural Language Processing - Introduction to NLP

We can see the basic preprocessing steps when working with textual data. We can use these techniques to gain more insights into the data that we have. Let's import the necessary libraries.

Part of Speech Tagging

The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.

Output:

[('You', 'PRP'),
 ('just', 'RB'),
 ('gave', 'VBD'),
 ('me', 'PRP'),
 ('a', 'DT'),
 ('scare', 'NN')]

In the given example, PRP stands for personal pronoun, RB for adverb, VBD for verb past tense, DT for determiner and NN for noun. We can get the details of all the part of speech tags using the Penn Treebank tagset.

Output:

NN: noun, common, singular or mass
 common-carrier cabbage knuckle-duster Casino afghan shed thermostat
 investment slide humour falloff slick wind hyena override subhumanity
 machinist ...

Chunking

Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into "chunks", mainly of noun phrases. Chunking is done using regular expressions.

Output:

(S
 (NP the/DT little/JJ yellow/JJ bird/NN)
 is/VBZ
 flying/VBG
 in/IN
 (NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)

In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Libraries like spaCy and Textblob are more suited for chunking.

Example:

Input: 'the little yellow bird is flying in the sky'

Output: (S (NP the/DT little/JJ yellow/JJ bird/NN) is/VBZ flying/VBG in/IN (NP the/DT sky/NN)) (NP the/DT little/JJ yellow/JJ bird/NN) (NP the/DT sky/NN)

👁 Image

Named Entity Recognition

As we know Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

Example:

Input: 'Bill works for GeeksforGeeks so he went to Delhi for a meetup.'

Output: (S
(PERSON Bill/NNP)
works/VBZ
for/IN
(ORGANIZATION GeeksforGeeks/NNP)
so/RB
he/PRP
went/VBD
to/TO
(GPE Delhi/NNP)
for/IN
a/DT
meetup/NN
./.)

Conclusion

In conclusion, natural language processing (NLP) plays a pivotal role in bridging the gap between human communication and computer understanding. As this field progresses, we can anticipate further innovations that will reshape how we communicate with and leverage the capabilities of intelligent systems in our daily lives and professional endeavors.

Comment

Article Tags:

Python

Python Programs

Python-nltk

Natural-language-processing