VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/03/learn-basics-of-natural-language-processing-nlp-using-gensim-part-1/

⇱ Learn Basics of Natural Language Processing (NLP) using Gensim: Part 1


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Learn Basics of Natural Language Processing (NLP) using Gensim: Part 1

yukthab Last Updated : 11 Mar, 2022
5 min read

Natural Language Processing, or NLP for short, is a branch of Artificial Intelligence that allows machines to comprehend, process, and manipulate human languages. The breakthrough in NLP bridged the gap between humans and machines, paving the way for leading-edge technologies such as Language Translator, Voice Assistants such as Siri, Customer Service Chatbots, and many more.

In this article, I’ll walk you through the fundamentals of text analysis using the powerful NLP library, Gensim.

Table of Contents

  1. Basics of Natural Language Processing
  2. Introduction to Gensim
  3. Hands-on with Gensim

Basics of Natural Language Processing

Natural Language Processing is all about handling natural languages, which can be text, audio, and video. This article will focus on understanding how to work with text data and discuss the building blocks of text data.

Token: A token is a string with a known meaning, and a token may be a word, number or just characters like punctuation. “Hello”, “123”, and “-” are some examples of tokens.

Sentence: A sentence is a group of tokens that is complete in meaning. “The weather looks good” is an example of a sentence, and the tokens of the sentence are [“The”, “weather”, “looks”, “good].

Paragraph: A paragraph is a collection of sentences or phrases, and a sentence can alternatively be viewed as a token of a paragraph.

Documents: A document might be a sentence, a paragraph, or a set of paragraphs. A text message sent to an individual is an example of a document.

Corpus: A corpus is typically an extensive collection of documents as a Bag-of-words. A corpus comprises each word’s id and frequency count in each record. An example of a corpus is a collection of emails or text messages sent to a particular person.

👁 Natural Language Processing

Introduction to Gensim

Gensim is a well-known open-source Python library used in NLP and Topic Modeling. Its ability to handle vast quantities of text data and its speed in training vector embeddings set it apart from the other NLP libraries. Moreover, Gensim provides popular topic modelling algorithms such as LDA, making it the go-to library for many users.

Hands-on with Gensim

Setting up Gensim is a pretty easy task. You can either install Gensim using the Pip installer or the Conda environment.

Creating a Dictionary

We can use Gensim to generate dictionaries from a list of sentences and text files. First, let’s look at making a dictionary out of a list of sentences.

You can see from the output that each token in the dictionary is assigned to a unique id.

👁 Gensim Create dictionary

Now, let’s make a dictionary with tokens from a text file. Initially, we’ll preprocess the file using Gensim’s simple_preprocess() function to retrieve the list of tokens from the file.

We have now successfully created a dictionary from the text file.

👁 Creating a dictionary from text file

We can also update an existing dictionary with tokens from a new document

👁 Updating an Existing dictionary | Natural Language Processing

Creating a Bag-of-Words

We can use the Gensim function doc2bow to generate our Bag of Words from the created dictionary. The Bag of Words returns a vector of tuples containing each token’s unique id and the number of occurrences in the document.

👁 Creating a BOW

Saving and Loading a Gensim Dictionary and BOW

We can save both our dictionary and BOW corpus and load them whenever you want.

Creating TF-IDF

“Term Frequency – Inverse Document Frequency” (TF-IDF) is a technique for measuring the importance of each word in a document by computing the word’s weight.
In the TF-IDF vector, the weight of each word is inversely proportional to the frequency of the word in that document.

👁 Creating TF-IDF

Creating Bigrams and Trigrams

Some words usually appear together in the text of a large document. When these words occur together, they may act as a single entity and have a completely different meaning than when they occur separately.

Let me use the phrase “Gateway to India” as an example. They have a completely different meaning when they occur together than when they occur separately. These groups of words are called “N-gram”.
Bigrams are N-grams of 2 words, and Trigrams are three words.

We’ll create bigrams and trigrams for the “text8” dataset, which is available for download via the Gensim Downloader API. We’ll be using Gensim’s Phrases function for this purpose.
The Trigram model is generated by passing the previously obtained bigram model to the Phrases function.

Creating a Word2Vec model

Word Embedding model is a model that represents a text as a numeric vector.
Word2Vec is a pre-built word embedding model from Gensim that uses an external neural network to embed words in a lower-dimensional vector space. Gensim’s Word2Vec model can implement the Skip-grams model and the Continuous Bag of Words model.

Let us initially train the Word2Vec model for the first 1000 words of the
‘text8″ dataset.

👁 Creating a Word2Vec model | Natural Language Processing

The above output is the word vector of “Social” found through this model.

Using the most_similar function, we can get all the words similar to the word, i.e. “Social” here.

👁 most_similar in Word2Vec | Natural Language Processing

You can also save your Word2Vec model and load it back.

Gensim also has a feature that enables you to update an existing Word2Vec model. We can update the model by calling the build_vocab function followed by the train function.

👁 Updating existing Word2Vec model | Natural Language Processing

EndNotes

We’ve gone over several key NLP topics to help you become more acquainted with text data manipulation using Gensim and begin putting your NLP skills to use. I hope the above examples aid you in discovering the beauty of Natural Language Processing using Gensim.

Please read our latest articles on our website.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner