Natural Language Processing, or NLP for short, is a branch of Artificial Intelligence that allows machines to comprehend, process, and manipulate human languages. The breakthrough in NLP bridged the gap between humans and machines, paving the way for leading-edge technologies such as Language Translator, Voice Assistants such as Siri, Customer Service Chatbots, and many more.

In this article, I’ll walk you through the fundamentals of text analysis using the powerful NLP library, Gensim.

Basics of Natural Language Processing
Introduction to Gensim
Hands-on with Gensim

Basics of Natural Language Processing

Natural Language Processing is all about handling natural languages, which can be text, audio, and video. This article will focus on understanding how to work with text data and discuss the building blocks of text data.

Token: A token is a string with a known meaning, and a token may be a word, number or just characters like punctuation. “Hello”, “123”, and “-” are some examples of tokens.

Sentence: A sentence is a group of tokens that is complete in meaning. “The weather looks good” is an example of a sentence, and the tokens of the sentence are [“The”, “weather”, “looks”, “good].

Paragraph: A paragraph is a collection of sentences or phrases, and a sentence can alternatively be viewed as a token of a paragraph.

Documents: A document might be a sentence, a paragraph, or a set of paragraphs. A text message sent to an individual is an example of a document.

Corpus: A corpus is typically an extensive collection of documents as a Bag-of-words. A corpus comprises each word’s id and frequency count in each record. An example of a corpus is a collection of emails or text messages sent to a particular person.

👁 Natural Language Processing

Introduction to Gensim

Gensim is a well-known open-source Python library used in NLP and Topic Modeling. Its ability to handle vast quantities of text data and its speed in training vector embeddings set it apart from the other NLP libraries. Moreover, Gensim provides popular topic modelling algorithms such as LDA, making it the go-to library for many users.

Hands-on with Gensim

Setting up Gensim is a pretty easy task. You can either install Gensim using the Pip installer or the Conda environment.

Creating a Dictionary

We can use Gensim to generate dictionaries from a list of sentences and text files. First, let’s look at making a dictionary out of a list of sentences.

You can see from the output that each token in the dictionary is assigned to a unique id.

👁 Gensim Create dictionary

Now, let’s make a dictionary with tokens from a text file. Initially, we’ll preprocess the file using Gensim’s simple_preprocess() function to retrieve the list of tokens from the file.

We have now successfully created a dictionary from the text file.

👁 Creating a dictionary from text file

We can also update an existing dictionary with tokens from a new document

👁 Updating an Existing dictionary | Natural Language Processing

Creating a Bag-of-Words

We can use the Gensim function doc2bow to generate our Bag of Words from the created dictionary. The Bag of Words returns a vector of tuples containing each token’s unique id and the number of occurrences in the document.

👁 Creating a BOW

Saving and Loading a Gensim Dictionary and BOW

We can save both our dictionary and BOW corpus and load them whenever you want.

Creating TF-IDF

“Term Frequency – Inverse Document Frequency” (TF-IDF) is a technique for measuring the importance of each word in a document by computing the word’s weight.
In the TF-IDF vector, the weight of each word is inversely proportional to the frequency of the word in that document.

👁 Creating TF-IDF

Creating Bigrams and Trigrams

Some words usually appear together in the text of a large document. When these words occur together, they may act as a single entity and have a completely different meaning than when they occur separately.

Let me use the phrase “Gateway to India” as an example. They have a completely different meaning when they occur together than when they occur separately. These groups of words are called “N-gram”.
Bigrams are N-grams of 2 words, and Trigrams are three words.

We’ll create bigrams and trigrams for the “text8” dataset, which is available for download via the Gensim Downloader API. We’ll be using Gensim’s Phrases function for this purpose.
The Trigram model is generated by passing the previously obtained bigram model to the Phrases function.

Creating a Word2Vec model

Word Embedding model is a model that represents a text as a numeric vector.
Word2Vec is a pre-built word embedding model from Gensim that uses an external neural network to embed words in a lower-dimensional vector space. Gensim’s Word2Vec model can implement the Skip-grams model and the Continuous Bag of Words model.

Let us initially train the Word2Vec model for the first 1000 words of the
‘text8″ dataset.

👁 Creating a Word2Vec model | Natural Language Processing

The above output is the word vector of “Social” found through this model.

Using the most_similar function, we can get all the words similar to the word, i.e. “Social” here.

You can also save your Word2Vec model and load it back.

Gensim also has a feature that enables you to update an existing Word2Vec model. We can update the model by calling the build_vocab function followed by the train function.

👁 Updating existing Word2Vec model | Natural Language Processing

EndNotes

We’ve gone over several key NLP topics to help you become more acquainted with text data manipulation using Gensim and begin putting your NLP skills to use. I hope the above examples aid you in discovering the beauty of Natural Language Processing using Gensim.

Please read our latest articles on our website.

yukthab

Libraries NLP Python Text

Login to continue reading and enjoy expert-curated content.

Free Courses

👁 Generative AI
4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

👁 Generative AI
4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

👁 Generative AI
4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

👁 Generative AI
4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

👁 Generative AI
4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Cancel reply

Become an Author

Share insights, grow your voice, and inspire the data community.

Reach a Global Audience
Share Your Expertise with the World
Build Your Brand & Audience

Join a Thriving AI Community
Level Up Your AI Game
Expand Your Influence in Genrative AI

👁 imag

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

👁 Av Logo White

Continue your learning for FREE

👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner

👁 AI Popup Banner

URL: https://www.analyticsvidhya.com/blog/2022/03/learn-basics-of-natural-language-processing-nlp-using-gensim-part-1/