![]() |
VOOZH | about |
Text processing is a key component of Natural Language Processing (NLP). It helps us clean and convert raw text data into a format suitable for analysis and machine learning.
Below are some common text preprocessing techniques in Python.
We convert the text lowercase to reduce the size of the vocabulary of our text data.
hey, did you know that the summer break is coming? amazing right !! it's only 5 more days !!
We can either remove numbers or convert the numbers into their textual representations. To remove the numbers we can use regular expressions.
There are balls in this bag, and in the other one.
We can also convert the numbers into words. This can be done by using the inflect library.
Output
There are three balls in this bag and twelve in the other one.
We remove punctuations so that we don't have different forms of the same word. For example if we don't remove the punctuation then been. been, been! will be treated separately.
Hey did you know that the summer break is coming Amazing right Its only 5 more days
We can use the join and split functions to remove all the white spaces in a string.
we don't need the given questions
Stopwords are words that do not contribute much to the meaning of a sentence hence they can be removed. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text. Below is the list of stopwords available in NLTK.
👁 ImageOutput
['sample', 'sentence', 'going', 'remove', 'stopwords', '.']
Stemming is the process of getting the root form of a word. Stem or root is the part to which affixes like -ed, -ize, -de, -s, etc are added. The stem of a word is created by removing the prefix or suffix of a word.
Examples:
books --> book
looked --> look
denied --> deni
flies --> fli
There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them.
Output
['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of', 'process']
Lemmatization is an NLP technique that reduces a word to its root form. This can be helpful for tasks such as text analysis and search as it allows us to compare words that are related but have different forms.
Output
['data', 'science', 'use', 'scientific', 'method', 'algorithm']
POS tagging assigns each word its grammatical role like noun, verb, adjective, etc hence helping machines understand sentence structure and meaning for tasks like parsing, information extraction and text analysis.
Output
[('Data', 'NNP'), ('science', 'NN'), ('combines', 'VBZ'),
('statistics', 'NNS'), (',', ','), ('programming', 'NN'),
(',', ','), ('and', 'CC'), ('machine', 'NN'), ('learning', 'NN'), ('.', '.')]
POS Tags Reference: