![]() |
VOOZH | about |
In this article, we will introduce ourselves to the TextaCy module in python which is generally used to perform a variety of NLP tasks on texts. It is built upon the SpaCy module in Python.
We can install the textaCy module using pip.
pip install textacy
If someone uses conda then write the following command -
conda install -c conda-forge textacy
Here we will see some of the notable features of textaCy module.
Using the preprocessing class of textacy module we can easily remove punctuation from our text.
The text used here is a randomly generated text from an external website. Firstly, we imported preprocessing class of textacy module and then used the remove and punctuation methods to remove the punctuations.
Output:
Now is the winter of our discontent Made glorious summer by this sun of York And all the clouds that lour d upon our house In the deep bosom of the ocean buried Now are our brows bound with victorious wreaths Our bruised arms hung up for monuments Our stern alarums changed to merry meetings Our dreadful marches to delightful measures Grim visaged war hath smooth d his wrinkled front And now instead of mounting barded steeds To fright the souls of fearful adversaries He capers nimbly in a lady s chamber To the lascivious pleasing of a lute But I that am not shaped for sportive tricks Nor made to court an amorous looking glass I that am rudely stamp d and want love s majesty To strut before a wanton ambling nymph I that am curtail d of this fair proportion
We can remove unnecessary whitespaces from our text. It will remove all the extra spaces we have and cut them all to only a single space after each word.
Here we used the normalize class and whitespace method to remove whitespaces.
Output:
In the output, we can see all the excess whitespace is being removed but the punctuations are still there. So if we want to remove that too then we can amalgamate both operations.
Now is the winter of our discontent Made glorious summer by this sun of York; And all the clouds that lour'd upon our house In the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; Our bruised arms hung up for monuments; Our stern alarums changed to merry meetings, Our dreadful marches to delightful measures. Grim-visaged war hath smooth'd his wrinkled front; And now, instead of mounting barded steeds To fright the souls of fearful adversaries, He capers nimbly in a lady's chamber To the lascivious pleasing of a lute. But I, that am not shaped for sportive tricks, Nor made to court an amorous looking-glass; I, that am rudely stamp'd, and want love's majesty To strut before a wanton ambling nymph; I, that am curtail'd of this fair proportion,
Output:
Now is the winter of our discontent Made glorious summer by this sun of York And all the clouds that lour d upon our house In the deep bosom of the ocean buried Now are our brows bound with victorious wreaths Our bruised arms hung up for monuments Our stern alarums changed to merry meetings Our dreadful marches to delightful measures Grim visaged war hath smooth d his wrinkled front And now instead of mounting barded steeds To fright the souls of fearful adversaries He capers nimbly in a lady s chamber To the lascivious pleasing of a lute But I that am not shaped for sportive tricks Nor made to court an amorous looking glass I that am rudely stamp d and want love s majesty To strut before a wanton ambling nymph I that am curtail d of this fair proportion
Sometimes the text we receive or use is 'raw' means unstructured, messy, etc, so before analysis, in the preprocessing stage, we might need to clean them up and partition them based on certain criteria.
Output:
Now the output looks a bit complex because the text used here was not appropriate for this cause. But as I have used the text which was already punctuation and whitespace free we can't see any punctuation or extra whitespace. The blank spaces created here are due to the window_width, all the whitespace that was there in the text has been removed alongside the punctuation.
[(' Now ', 'i', 's the winter of our '),
(' Now is the w', 'i', 'nter of our disconte'),
(' the winter of our d', 'i', 'scontent\nMade glorio'),
('discontent\nMade glor', 'i', 'ous summer by this s'),
('lorious summer by th', 'i', 's sun of York \nAnd a'),
('ur d upon our house\n', 'I', 'n the deep bosom of '),
('som of the ocean bur', 'i', 'ed \nNow are our brow'),
('re our brows bound w', 'i', 'th victorious wreath'),
('r brows bound with v', 'i', 'ctorious wreaths \nOu'),
('ws bound with victor', 'i', 'ous wreaths \nOur bru'),
('ous wreaths \nOur bru', 'i', 'sed arms hung up for'),
('hanged to merry meet', 'i', 'ngs \nOur dreadful ma'),
('adful marches to del', 'i', 'ghtful measures \nGri'),
('ightful measures \nGr', 'i', 'm visaged war hath s'),
('ful measures \nGrim v', 'i', 'saged war hath smoot'),
(' war hath smooth d h', 'i', 's wrinkled front \nAn'),
('hath smooth d his wr', 'i', 'nkled front \nAnd now'),
('kled front \nAnd now ', 'i', 'nstead of mounting b'),
('now instead of mount', 'i', 'ng barded steeds\nTo '),
(' barded steeds\nTo fr', 'i', 'ght the souls of fea'),
(' of fearful adversar', 'i', 'es \nHe capers nimbly'),
('rsaries \nHe capers n', 'i', 'mbly in a lady s cha'),
('s \nHe capers nimbly ', 'i', 'n a lady s chamber\nT'),
(' chamber\nTo the lasc', 'i', 'vious pleasing of a '),
('hamber\nTo the lasciv', 'i', 'ous pleasing of a lu'),
('the lascivious pleas', 'i', 'ng of a lute \nBut I '),
('sing of a lute \nBut ', 'I', ' that am not shaped '),
('not shaped for sport', 'i', 've tricks \nNor made '),
('aped for sportive tr', 'i', 'cks \nNor made to cou'),
('ourt an amorous look', 'i', 'ng glass \nI that am '),
('rous looking glass \n', 'I', ' that am rudely stam'),
('before a wanton ambl', 'i', 'ng nymph \nI that am '),
('nton ambling nymph \n', 'I', ' that am curtail d o'),
('mph \nI that am curta', 'i', 'l d of this fair pro'),
('t am curtail d of th', 'i', 's fair proportion '),
('curtail d of this fa', 'i', 'r proportion '),
('of this fair proport', 'i', 'on ')]
The below section shows the result if we don't remove the punctuation or whitespace earlier, I didn't include the entire output as it is big and as all the punctuation is available alongside whitespace it would look messy.
[(' \nNow ', 'i', 's the winter of our '),
(' \nNow is the w', 'i', 'nter of our dis'),
('winter of our d', 'i', 'scontent\nMade glorio'),
('discontent\nMade glor', 'i', 'ous summer by this s'),
('lorious summer by th', 'i', 's sun of York;\nAnd a'),
("ur'd upon our house\n", 'I', 'n the deep b'),
('som of the ocean bur', 'i', 'ed.\nNow are our brow').......]We can remove any unnecessary URLs from our text and replace it with some other text -
Output:
Output:
Output:
If we pass more than one number then this will replace all of them with NUM.
Output -
Output:
Output:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica in the Netherlands as a successor to the ABC programming language, which was inspired by SETL, capable of exception handling
We can also pass an keyworded argument called only and pass a list of type brackets we only want to be removed. It supports three values square, curly , round.
Output:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica in the Netherlands
as a successor to the , which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}