Tokenization Using Spacy

Last Updated : 25 May, 2026

Tokenization is a preprocessing step in NLP where text is divided into smaller units called tokens such as words, punctuation marks or special characters. This makes text easier for machines to process and analyze.

👁 tokenization_in_natural_language_processing

Tokenization

Let's see an example:

Sentence: “I love natural language processing!”
After tokenization: ["I", "love", "natural", "language", "processing", "!"]

Features of SpaCy Tokenizer

Performs fast and efficient tokenization on large text datasets
Treats punctuation marks as separate tokens for accurate text processing
Supports language-specific tokenization rules and grammar patterns
Correctly handles spaces, newlines, URLs, hashtags, and email addresses
Allows customization through user-defined tokenization rules

Implementation

Here, we’ll see how to implement tokenization using SpaCy.

1. Blank Model Tokenization

In this method, SpaCy’s blank model spacy.blank("en") is used to perform basic tokenization without pre-trained NLP components such as POS tagging or named entity recognition.

Initializes a minimal NLP pipeline
Performs basic text tokenization
Does not include pre-trained language processing features

Output:

👁 Tokenization-Using-Spacy1

Tokens

2. Displaying the Pipeline Components

In this step, the pre-trained en_core_web_sm model is loaded to access SpaCy’s NLP components. The pipeline components can then be displayed to view the available processing modules.

Loads the pre-trained SpaCy model
Displays available NLP pipeline components
Includes components for tasks like tagging and parsing

Output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

3. Tokenization with Part-of-Speech Tagging and Lemmatization

In this step, text is processed to perform tokenization, identify part-of-speech tags and extract the lemmatized form of each token.

Splits text into tokens
Assigns grammatical POS tags
Extracts root or dictionary forms of words

Output:

👁 Tokenization-Using-Spacy2

Example with POS Tagging and Lemmatization

Download full code from here

Advantages

Provides fast and efficient tokenization for large text datasets
Offers accurate tokenization using pre-trained NLP models
Simple and user-friendly API for easy implementation
Supports integration with tasks like lemmatization, parsing and NER
Useful for building custom NLP pipelines and applications

Limitations

Pre-trained models may require high memory usage
Domain-specific text may need additional fine-tuning
Performance can vary for languages with limited linguistic resources
Large models may not suit memory constrained environments

Comment

Article Tags:

NLP

NLP-Projects

Explore

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Courses

URL: https://www.geeksforgeeks.org/nlp/tokenization-using-spacy-library/