![]() |
VOOZH | about |
SparkNLP is a powerful Python library designed for a wide range of Natural Language Processing (NLP) tasks, built on top of Apache Spark. This library offers high-performance annotators for tasks such as StopWordsCleaner, Tokenizer, Chunker, and more. By integrating the distributed computing power of Spark with state-of-the-art NLP algorithms, SparkNLP is suitable for both small projects and enterprise-level applications.
In this article, we will explore the functionalities of SparkNLP.
If you want to install this library using pip, you'll need to install one of its dependencies, pyspark, if it's not already installed.
Below is the full command to install both:
pip install spark-nlp pysparkIf you're working in a Google Colab notebook, there's an easy way to get started without any installation or setup.
Simply run the following code in your Colab notebook, and you can start using Spark NLP right away:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh -O - | bashNamed Entity Recognition (NER) is a fundamental yet crucial task in NLP, where we aim to identify and classify entities within a text. These entities can include names of people, organizations, locations, and numerical data such as money, percentages, and time. SparkNLP offers a powerful pre-trained model for NER called `recognize_entities_dl`, which can be seamlessly integrated into an NLP pipeline.
To implement NER using sparknlp we will perform following steps:
sparknlp and the PretrainedPipeline function from sparknlp.pretrained.sparknlp.start() function.recognize_entities_dl model into the PretrainedPipeline() function.pipeline.annotate() function.entities key from the result dictionary.Output:
recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]
['Adil Naib', 'GeeksForGeeks', 'Data Science and Machine Learning']
Stop word recognition and removing them is an important step in text preprocessing while building language models as they don’t contribute much to the meaning of a sentence. We remove these stop words to reduce the noise in the text data and to improve the performance of the language models. SparkNLP provides ‘StopWordsCleaner’ annotator to remove stop words from text.
To implement stop word removal, we will follow these steps:
sparknlp.base and sparknlp.annotator.spark.createDataFrame() function to create a DataFrame containing the text data you want to process.DocumentAssembler, which will convert the text data into a format that Spark NLP can process.setInputCol("text")."document" using setOutputCol("document").Tokenizer, which will split the document into individual tokens."document" using setInputCols(["document"])."token" using setOutputCol("token").StopWordsCleaner, which will remove common stopwords from the tokens."token" using setInputCols(["token"])."cleanTokens" using setOutputCol("cleanTokens").setCaseSensitive(False) if you want to ignore case when removing stopwords.Pipeline with the stages [document_assembler, tokenizer, stopwords_cleaner].fit() function.transform() function to apply the pipeline to your data and get the cleaned tokens.select() function to select the "cleanTokens.result" column.show(truncate=False) function to display the cleaned tokens without truncating the output.Output:
[Adil, Naib, ,, one, authors, GeeksForGeeks, ,, published, many, articles, topics, like, Data, Science, Machine, Learning, .]In tokenization we break down text into individual words called tokens. Effective tokenization is important for tasks such as text classification, sentiment analysis, and machine translation. SparkNLP provides ‘Tokenizer’ annotator which will tokenize the whole text.
We will implement tokenization using following steps:
"text" column, Output: "document" column."document" column, Output: "token" column.DocumentAssembler and Tokenizer into a sequential process."token.result" column without truncation.Output:
[Adil, Naib, ,, is, one, of, the, authors, of, GeeksForGeeks, ,, has, published, many, articles, on, topics, like, Data, Science, and, Machine, Learning, .]Chunking, also known as shallow parsing, involves grouping words into chunks based on their context and grammatical structure. This step is crucial for understanding the context and sentiment of the entire text. Chunking is commonly used in tasks such as text summarization and sentiment analysis. SparkNLP offers a `Chunker` annotator that enables us to chunk sentences effectively.
To implement chunking, we are going to implement following steps:
setInputCol() and setOutputCol() to specify input and output columns.setInputCols() and setOutputCol() to define input and output columns.setInputCols() and setOutputCol().PerceptronModel and set the input and output columns.DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, and Chunker) into a pipeline.fit() function.transform() function.select() and show().Output:
[Adil Naib, GeeksForGeeks, Data Science, Machine Learning]