![]() |
VOOZH | about |
Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words or sentences. It helps in improving interpretability of text by different models. Let's understand How Tokenization Works.
Natural Language Processing (NLP) is a subfield of Artificial Intelligence, information engineering and human-computer interaction. It focuses on how to process and analyze large amounts of natural language data efficiently. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.
Tokenization can be classified into several types based on how the text is segmented. Here are some types of tokenization:
Word tokenization is the most commonly used method where text is divided into individual words. It works well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating" becomes:
Input before tokenization: ["Machine Learning is fascinating"]
Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]
In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear boundaries. It can also be useful for modelling character-level language.
Example
Input before tokenization: ["You are helpful"]
Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]
This strikes a balance between word and character tokenization by breaking down text into units that are larger than a single character but smaller than a full word. This is useful when dealing with morphologically rich languages or rare words.
Example
["Time", "table"]
["Rain", "coat"]
["Grace", "fully"]
["Run", "way"]
Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form words by combining smaller units.
Sentence tokenization is also a common technique used to make a division of paragraphs or large set of sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or processing.
Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is fascinating. Computer Vision handles images. "]
Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine learning is fascinating.", "Computer Vision handles images."]
N-gram tokenization splits words into fixed-sized chunks (size = n) of data.
Input before tokenization: ["Machine learning is powerful"]
Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is', 'powerful')]
Tokenization is an essential step in text processing and natural language processing (NLP) for several reasons. Some of these are listed below:
The code snippet uses sent_tokenize function from NLTK library. The sent_tokenize function is used to segment a given text into a list of sentences.
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
How sent_tokenize works: The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.
It is efficient to use 'PunktSentenceTokenizer' to from the NLTK library. The Punkt tokenizer is a data-driven sentence tokenizer that comes with NLTK. It is trained on large corpus of text to identify sentence boundaries.
Output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
Sentences from different languages can also be tokenized using different pickle file other than English.
Output:
['Hola amigo.',
'Estoy bien.']
The code snipped uses the word_tokenize function from NLTK library to tokenize a given text into individual words.
Output:
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']How word_tokenize works: word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.
The code snippet uses the TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text into individual words.
Output:
['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn't discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.
The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries. Each punctuation mark is treated as a separate token.
Output:
['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']
The code snippet uses the RegexpTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text based on a regular expression pattern.
Output:
['Let', 's', 'see', 'how', 'it', 's', 'working']Using regular expressions allows for more fine-grained control over tokenization and you can customize the pattern based on your specific requirements.
We have discussed the ways to implement how can we perform tokenization using NLTK library. We can also implement tokenization using following methods and libraries: