![]() |
VOOZH | about |
Tokenization is a fundamental task in Natural Language Processing that breaks down a text into smaller units such as words or sentences which is used in tasks like text classification, sentiment analysis and named entity recognition. TextBlob is a python library for processing textual data and simplifies many NLP tasks including tokenization. In this article we'll explore how to tokenize text using the TextBlob library in Python.
TextBlob is a simple NLP library built on top of NLTK (Natural Language Toolkit) and Pattern. It provides easy-to-use APIs for common NLP tasks like tokenization, part-of-speech tagging, noun phrase extraction, translation and many more. It offers two main types of tokenization:
Before starting we need to install TextBlob. You can easily install it using following command in command-line interface (CLI):
pip install textblobOnce installed you also need to download the necessary NLTK corpora which are used for various TextBlob operations such as tokenization. Run this Python code to download the corpora:
Output:
Letβs start by tokenizing text into words. We will use the TextBlob class to create a TextBlob object which allows us to easily manipulate the text.
TextBlob object with a sample text.words property of TextBlob object returns a list of words in the text breaking the sentence into individual tokens i.e words.Output:
['Hello', 'I', 'am', 'learning', 'NLP', 'with', 'TextBlob']
Now we will tokenize text into sentences. To do this you can use the sentences property of the TextBlob object.
sentences() property to break the text into two individual sentences.Output:
Hello!
I am learning NLP with TextBlob.
It's a fun journey.
Once you've tokenized the text into words or sentences you can perform further processing on the tokens. Here are a few common operations you can do with tokenized data:
Here we downloaded a list of stop words using NLTK's stopwords corpus and filtered out the stop words from the tokenized words list.
['Hello', 'learning', 'NLP', 'TextBlob']
Tokenization is a important step in NLP and TextBlob simplifies this process in Python. With TextBlob you can easily tokenize text into words and sentences and perform further operations such as filtering stop words and analyzing word frequencies.