SparkNLP: A Comprehensive Guide to NLP Library

Last Updated : 23 Jul, 2025

SparkNLP is a powerful Python library designed for a wide range of Natural Language Processing (NLP) tasks, built on top of Apache Spark. This library offers high-performance annotators for tasks such as StopWordsCleaner, Tokenizer, Chunker, and more. By integrating the distributed computing power of Spark with state-of-the-art NLP algorithms, SparkNLP is suitable for both small projects and enterprise-level applications.

In this article, we will explore the functionalities of SparkNLP.

Installation of SparkNLP

Via PyPI:

If you want to install this library using pip, you'll need to install one of its dependencies, pyspark, if it's not already installed.

Below is the full command to install both:

pip install spark-nlp pyspark

Via Google Colab Kernel

If you're working in a Google Colab notebook, there's an easy way to get started without any installation or setup.

Simply run the following code in your Colab notebook, and you can start using Spark NLP right away:

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh -O - | bash

Functionalities of SparkNLP

1. Named Entity Recognition (NER)

Named Entity Recognition (NER) is a fundamental yet crucial task in NLP, where we aim to identify and classify entities within a text. These entities can include names of people, organizations, locations, and numerical data such as money, percentages, and time. SparkNLP offers a powerful pre-trained model for NER called `recognize_entities_dl`, which can be seamlessly integrated into an NLP pipeline.

To implement NER using sparknlp we will perform following steps:

Import Necessary Libraries: Import sparknlp and the PretrainedPipeline function from sparknlp.pretrained.
Start a Spark Session:
- Start a Spark session using the sparknlp.start() function.
- Note: The initial startup might take more than a minute.
Create a Pre-trained Pipeline: Create a pre-trained pipeline by passing the recognize_entities_dl model into the PretrainedPipeline() function.
Define Sample Text: Define a sample text that you want to analyze using the pipeline.
Annotate the Text:
- Pass the sample text into the pipeline.annotate() function.
- This function will return various annotations such as 'entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', and 'sentence'.
Access Recognized Entities: To print the recognized entities, access the entities key from the result dictionary.

Output:

recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]

['Adil Naib', 'GeeksForGeeks', 'Data Science and Machine Learning']

2. Stop Words Removal

Stop word recognition and removing them is an important step in text preprocessing while building language models as they don’t contribute much to the meaning of a sentence. We remove these stop words to reduce the noise in the text data and to improve the performance of the language models. SparkNLP provides ‘StopWordsCleaner’ annotator to remove stop words from text.

To implement stop word removal, we will follow these steps:

Import Necessary Classes: Import the necessary classes from sparknlp.base and sparknlp.annotator.
Create a Sample DataFrame: Use the spark.createDataFrame() function to create a DataFrame containing the text data you want to process.
Set Up the DocumentAssembler:
- Initialize the DocumentAssembler, which will convert the text data into a format that Spark NLP can process.
- Set the input column to your text data column using setInputCol("text").
- Set the output column to "document" using setOutputCol("document").
Set Up the Tokenizer:
- Initialize the Tokenizer, which will split the document into individual tokens.
- Set the input columns to "document" using setInputCols(["document"]).
- Set the output column to "token" using setOutputCol("token").
Set Up the StopWordsCleaner:
- Initialize the StopWordsCleaner, which will remove common stopwords from the tokens.
- Set the input columns to "token" using setInputCols(["token"]).
- Set the output column to "cleanTokens" using setOutputCol("cleanTokens").
- Optionally, set setCaseSensitive(False) if you want to ignore case when removing stopwords.
Create a Pipeline: Create a Pipeline with the stages [document_assembler, tokenizer, stopwords_cleaner].
Fit the Model to the Data: Fit the pipeline model to the data using the fit() function.
Transform the Data: Use the transform() function to apply the pipeline to your data and get the cleaned tokens.
Display the Cleaned Tokens:
- Use the select() function to select the "cleanTokens.result" column.
- Use the show(truncate=False) function to display the cleaned tokens without truncating the output.

Output:

[Adil, Naib, ,, one, authors, GeeksForGeeks, ,, published, many, articles, topics, like, Data, Science, Machine, Learning, .]

3. Tokenization

In tokenization we break down text into individual words called tokens. Effective tokenization is important for tasks such as text classification, sentiment analysis, and machine translation. SparkNLP provides ‘Tokenizer’ annotator which will tokenize the whole text.

We will implement tokenization using following steps:

DocumentAssembler Setup: Converts raw text into a format that Spark NLP can process and add input: "text" column, Output: "document" column.
Tokenizer Setup:
- Splits the text into individual tokens (words).
- Input: "document" column, Output: "token" column.
Pipeline Creation: Combines DocumentAssembler and Tokenizer into a sequential process.
Fit Pipeline: Trains the pipeline on the input data to prepare it for transformation.
Transform Data: Applies the trained pipeline to the data, generating tokens.
Display Tokens: Selects and shows the generated tokens from the "token.result" column without truncation.

Output:

[Adil, Naib, ,, is, one, of, the, authors, of, GeeksForGeeks, ,, has, published, many, articles, on, topics, like, Data, Science, and, Machine, Learning, .]

Chunking

Chunking, also known as shallow parsing, involves grouping words into chunks based on their context and grammatical structure. This step is crucial for understanding the context and sentiment of the entire text. Chunking is commonly used in tasks such as text summarization and sentiment analysis. SparkNLP offers a `Chunker` annotator that enables us to chunk sentences effectively.

To implement chunking, we are going to implement following steps:

Set Up DocumentAssembler: Convert the text into a document format and use setInputCol() and setOutputCol() to specify input and output columns.
Set Up SentenceDetector: Identify sentences within the document and use setInputCols() and setOutputCol() to define input and output columns.
Set Up Tokenizer: Split sentences into individual tokens (words) and specify the input and output columns using setInputCols() and setOutputCol().
Set Up PerceptronModel: Tag each token with part-of-speech labels using the pre-trained PerceptronModel and set the input and output columns.
Set Up Chunker: Identify chunks of text based on specified regex patterns and define the input columns for sentences and POS tags, and set the output column.
Create Pipeline: Combine the stages (DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, and Chunker) into a pipeline.
Fit Pipeline to Data: Train the pipeline on the provided data using the fit() function.
Transform Text: Apply the trained pipeline to the text data to generate the chunks using the transform() function.
Display Chunks: Select and show the generated chunks using select() and show().

Output:

[Adil Naib, GeeksForGeeks, Data Science, Machine Learning]

Comment

Article Tags:

NLP

AI-ML-DS

AI-ML-DS With Python

Explore

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Courses

URL: https://www.geeksforgeeks.org/nlp/sparknlp-a-comprehensive-guide-to-nlp-library/