![]() |
VOOZH | about |
Text tokenization is a fundamental Natural Language Processing (NLP) technique and one such technique is Tokenization. It is the process of dividing text into smaller components or tokens. These can be:
With Python’s popular library NLTK (Natural Language Toolkit), splitting text into meaningful units becomes both simple and extremely effective.
Let's see the implementation of Tokenization using NLTK in Python,
Install the “punkt” tokenizer models needed for sentence and word tokenization.
sent_tokenize() splits a string into a list of sentences, handling punctuation and abbreviations.
Output:
['NLTK is a great NLP toolkit.', 'It makes processing text easy!']
Output:
['Tokenization', 'is', 'easy', 'with', 'NLTK', "'s", 'word_tokenize', '.']
Lets see some more Examples,
It Splits text into alphabetic and non-alphabetic characters,
Output:
['Don', "'", 't', 'split', 'contractions', '.', 'E', '-', 'mails', ':', 'hello', '@', 'example', '.', 'com', '!']
It is suitable for linguistic analysis, handles punctuation and contractions.
Output:
['Have', 'a', 'look', 'at', 'NLTK', "'s", 'tokenizers', '.']
It customize pattern-based splitting.
Output:
['Custom', 'rule', 'keep', 'only', 'words', 'numbers', 'drop', 'punctuation']
NLTK provides a useful and user-friendly toolkit for tokenizing text in Python, supporting a range of tokenization needs from basic word and sentence splitting to advanced custom patterns.