Tokenization is a preprocessing step in NLP where text is divided into smaller units called tokens such as words, punctuation marks or special characters. This makes text easier for machines to process and analyze.
After tokenization: ["I", "love", "natural", "language", "processing", "!"]
Features of SpaCy Tokenizer
Performs fast and efficient tokenization on large text datasets
Treats punctuation marks as separate tokens for accurate text processing
Supports language-specific tokenization rules and grammar patterns
Correctly handles spaces, newlines, URLs, hashtags, and email addresses
Allows customization through user-defined tokenization rules
Implementation
Here, we’ll see how to implement tokenization using SpaCy.
1. Blank Model Tokenization
In this method, SpaCy’s blank model spacy.blank("en") is used to perform basic tokenization without pre-trained NLP components such as POS tagging or named entity recognition.
Initializes a minimal NLP pipeline
Performs basic text tokenization
Does not include pre-trained language processing features
In this step, the pre-trained en_core_web_sm model is loaded to access SpaCy’s NLP components. The pipeline components can then be displayed to view the available processing modules.
Loads the pre-trained SpaCy model
Displays available NLP pipeline components
Includes components for tasks like tagging and parsing