In this article we are going to tokenize sentence, paragraph, and webpage contents using the NLTK toolkit in the python environment then we will remove stop words and apply stemming on the contents of sentences, paragraphs, and webpage. Finally, we will Compute the frequency of words after removing stop words and stemming.
Modules Needed
bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.
pip install bs4
urllib: Urllib package is the Uniform Resource Locators handling library for python. It is used to fetch URLs.To install this library, type the following command in IDE/terminal.
pip install urllib
nltk: The NLTK library is a massive tool kit for Natural Language Processing in Python, this module helps us by providing the entire NLP methodology. To install this library, type the following command in IDE/terminal.
pip install nltk
Stepwise Implementation:
Step1:
- Save the files sentence.txt, paragraph.txt in the current directory.
- Open the files using the open method and store them in file operators named file1, and file2.
- Read the file contents using read() method and store the entire file contents into a single string.
- Display the file contents.
- Close the file operators.
Step2:
- Import urllib.request for opening and reading the webpage contents.
- From bs4 import BeautifulSoup which allows us to pull data out of HTML documents.
- Using make a request to that particular url server.
- The server will responds and returns the Html document.
- Read the contents of webpage using read() method.
- Pass the webpage data into BeautifulSoup which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable structures.
Step3:
- To simplify the task of tokenizing we are going to extract an only a portion of HTML page.
- Using BeautifulSoup operator extract all the paragraph tags present in HTML document.
- Soup(βpβ) returns a list of items that contain all the paragraph tags present on the webpage.
- Create an empty string named web_page_data.
- For each tag present in the list concatenate the text enclosed between the tags to the empty string.
Step4:
- Using re.sub() replace the non-alphabetical characters with an empty string.
- re.sub() takes a regular expression, new string and the input string as arguments and returns the modified string (Replaces the specified characters in the input string with the new string).
- ^ - means it will match the pattern written on right of it.
- \w - #Return a match at every non-alphabetical character(characters NOT between a and Z. Like "!", "?" white-space, numbers including underscore etc.) and \s - matches a blank space.
Step5:
- Pass sentence, paragraph, webpage contents after removing punctuations, unnecessary characters into word_tokenize() which returns tokenized text, paragraph, web string.
- Display the contents of the tokenized sentence, tokenized paragraph, tokenized web string.
Step6:
- from nltk.corpus import stopwords.
- Download stopwords using nltk.download('stopwords').
- Store the English stop words in nltk_stop_words.
- Compare each word in tokenized sentence, tokenized paragraph tokenized web string with words present in nltk_stop_words if any of the words in our data occurs in nltk stop words we are going to ignore those words.
Step7:
- from nltk.stem.porter import PorterStemmer.
- Do Stemming using nltk : removing the suffix and considering the root word.
- Create three empty lists for storing stemmed words of sentence, paragraph, webpage.
- Using stemmer.stem() stem each word present in the previous list and store it in newly created lists.
Step8:
- Sometimes after doing stemming it may result in misspelled words because it is an implementation issue.
- Using TextBlob module we can find the relevant correct words for a particular misspelled word.
- For each word in sentence_after_stemming, paragraph_after_stemming, webpage_after_stemming find the actual correct for that word using correct() method.
- Check whether the correct word present in stop words. If it is not present in stop words replace the correct word with the misspelled word.
Step9:
- Using Counter method in the Collections module find the frequency of words in sentences, paragraphs, webpage. Python Counter is a container that will hold the count of each of the elements present in the container.
- Counter method returns a dictionary with key-value pair as {βwordβ,word_count}.
Below is the full implementation: