![]() |
VOOZH | about |
Raw text data is often unstructured, noisy and inconsistent, containing typos, punctuation, stopwords and irrelevant information. Text preprocessing converts this data into a clean, structured and standardized format, enabling effective feature extraction and improving model performance.
Here we implement text preprocessing techniques in Python, showing how raw text is cleaned, transformed and prepared for NLP tasks.
Here we define a sample corpus containing a variety of text examples, including HTML tags, emojis, URLs, numbers, punctuation and typos. This corpus will be used to demonstrate each preprocessing step in detail.
Text cleaning is the process of removing noise and unwanted elements from raw text to make it structured and easier for NLP models to analyze. Regular expressions (regex) is a useful tool in text preprocessing that allow you to find, match and manipulate patterns in text efficiently.
Output:
Cleaned Corpus:
['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'welcome to the website', 'python is a great programming language', 'check out httpswwwexamplecom for more info', 'he won st prize in the comptition', 'i luvv this movie sooo much'
Tokenization is the process of breaking text into smaller units, such as words or sentences. This step converts raw text into a structured format that NLP models can analyze and process.
Output:
Tokenized Corpus:
[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['welcome', 'to', 'the', 'website'], ['python', 'is', 'a', 'great', 'programming', 'language'], ['check', 'out', 'httpswwwexamplecom', 'for', 'more', 'info'], ['he', 'won', 'st', 'prize', 'in', 'the', 'comptition'], ['i', 'luvv', 'this', 'movie', 'sooo', 'much']]
Stopwords are common words in a language (like “is”, “the”, “and”) that usually do not add significant meaning to text analysis. Removing them helps NLP models focus on the more meaningful words in the text.
Output:
Stopword Removed Corpus:
[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['welcome', 'website'], ['python', 'great', 'programming', 'language'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptition'], ['luvv', 'movie', 'sooo', 'much']]
Stemming is the process of reducing words to their root or base form. It helps in normalizing text by treating different forms of a word (e.g., “running”, “runs”) as the same word (“run”).
Output:
Stemmed Corpus:
[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'], ['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['welcom', 'websit'], ['python', 'great', 'program', 'languag'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptit'], ['luvv', 'movi', 'sooo', 'much']]
Lemmatization is the process of converting a word to its meaningful base or dictionary form, called a lemma. Unlike stemming, it ensures that the root word is an actual word in the language.
Output:
Lemmatized Corpus:
[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['welcome', 'website'], ['python', 'great', 'programming', 'language'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptition'], ['luvv', 'movie', 'sooo', 'much']]
Contractions expansion is the process of converting shortened forms of words (like “can’t”, “won’t”) into their full forms (“cannot”, “will not”). This helps NLP models better understand the meaning of the text.
Output:
Expanded Corpus:
['I cannot wait for the new season of my favorite show! 😍', 'The COVID-19 pandemic has affected millions of people worldwide.', 'YOU.S. stocks fell on Friday after news of rising inflation.', '<html><body>Welcome to the website!</body></html>', 'Python is a great programming language!!! ??', 'Check out https://www.example.com for more info!', 'He won 1st prize in the comp3tition!!!', 'I luvv this movie sooo much!!!']
Emoji conversion is the process of converting emojis in text into descriptive text labels. This allows NLP models to understand the meaning conveyed by emojis.
Output:
Emoji Converted Corpus:
["I can't wait for the new season of my favorite show! :smiling_face_with_heart-eyes:", 'The COVID-19 pandemic has affected millions of people worldwide.', 'U.S. stocks fell on Friday after news of rising inflation.', '<html><body>Welcome to the website!</body></html>', 'Python is a great programming language!!! ??', 'Check out https://www.example.com for more info!', 'He won 1st prize in the comp3tition!!!', 'I luvv this movie sooo much!!!']
Spell correction is the process of identifying and correcting misspelled words in text. This ensures that NLP models receive accurate and meaningful words for analysis.
Output:
Spell Corrected Corpus:
[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covin', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['welcome', 'to', 'the', 'website'], ['python', 'is', 'a', 'great', 'programming', 'language'], ['check', 'out', None, 'for', 'more', 'info'], ['he', 'won', 'st', 'prize', 'in', 'the', 'competition'], ['i', 'luvs', 'this', 'movie', 'soon', 'much']]
POS tagging assigns grammatical labels (like noun, verb, adjective) to each word in a sentence. This helps NLP models understand the role of words and their relationships in the text.
Output:
POS Tagged Corpus:
[[('i', 'NN'), ('cant', 'VBP'), ('wait', 'NN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('season', 'NN'), ('of', 'IN'), ('my', 'PRP$'), ('favorite', 'JJ'), ('show', 'NN')], [('the', 'DT'), ('covid', 'NN'), ('pandemic', 'NN'), ('has', 'VBZ'), ('affected', 'VBN'), ('millions', 'NNS'), ('of', 'IN'), ('people', 'NNS'), ('worldwide', 'VBP')], [('us', 'PRP'), ('stocks', 'NNS'), ('fell', 'VBD'), ('on', 'IN'), ('friday', 'NN'), ('after', 'IN'), ('news', 'NN'), ('of', 'IN'), ('rising', 'VBG'), ('inflation', 'NN')], [('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('website', 'NN')], [('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('programming', 'NN'), ('language', 'NN')], [('check', 'VB'), ('out', 'RP'), ('httpswwwexamplecom', 'NN'), ('for', 'IN'), ('more', 'JJR'), ('info', 'NN')], [('he', 'PRP'), ('won', 'VBD'), ('st', 'JJ'), ('prize', 'NN'), ('in', 'IN'), ('the', 'DT'), ('comptition', 'NN')], [('i', 'NN'), ('luvv', 'VBP'), ('this', 'DT'), ('movie', 'NN'), ('sooo', 'VBZ'), ('much', 'RB')]]
Download full code from here