Word2Vec with Gensim

Last Updated : 12 Jun, 2026

Word2Vec is a technique for learning word embeddings. It is based on the principle that words that appear in similar contexts tend to have similar meanings. For example, the words "king" and "queen" may often appear in similar contexts, and Word2Vec will represent them as vectors that are close to each other in the vector space. Word2Vec operates on two primary models:

Continuous Bag of Words (CBOW): Predicts the target word (center word) from its context words (surrounding words).
Skip-gram: Predicts the context words from the target word (center word).

Gensim is an open-source Python library specifically designed for unsupervised topic modelling and NLP tasks. It excels at handling large text corpora and includes several efficient algorithms, such as Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI) and Word2Vec.

Implementation

Step 1: Installing and Setting Up Gensim for Word2Vec

Before starting, make sure you have Python and the necessary libraries installed. To install Gensim, you can use the following command:

Step 2: Preprocessing Data for Word2Vec Models

Word2Vec requires large datasets of text to be effective, and preprocessing is a crucial step to ensure the model performs well. Preprocessing typically involves the following steps:

Tokenization: Splitting sentences into individual words.
Lowercasing: Converting all words to lowercase to avoid treating "Apple" and "apple" as different words.
Removing Stopwords: Filtering out common words like "the", "is", and "in".
Lemmatization/Stemming: Reducing words to their base or root forms.

Here is an example of how to preprocess a text dataset:

Output:

👁 Screenshot-2026-01-19-151253

Downloading and Unzipping

Step 3: Training Word2Vec Models with Gensim

Now that your data is preprocessed, you can start training your Word2Vec model using Gensim. The basic syntax for training a Word2Vec model is as follows:

Output:

👁 model_gensim

Word2Vec with Gensim

Step 4: Evaluate the Word2Vec Model

You can evaluate the model by checking word similarities and performing analogy tasks:

Output:

👁 Screenshot-2026-01-19-152127

Similarity Results

Step 5: Visualize Word Embeddings

To visualize the word embeddings, we will reduce their dimensionality using PCA:

Output:

👁 word_embed_vis

Step 6 Fine-Tuning the Word2Vec Model with Gensim

Fine-tuning can be done in various ways, including adjusting hyperparameters and re-training the model. Here's how you can implement fine-tuning in this context:

Adjust Hyperparameters: Change parameters such as vector_size, window, and min_count based on your understanding of the data and requirements.
Use More Data: If available, you can add more sentences to improve the quality of the learned embeddings.
Training on the Same or New Data: Re-train the model using the same or an expanded dataset.

Step 7: Evaluate the Model

Now we will evaluate our model by checking the similarity score between related terms ,Note that this may still be low because we used a very small corpus to train model.

Output:

👁 word_embed_vis2

Evaluation of model

Explanation of the Fine-Tuning Steps:

Build Vocabulary: We call model.build_vocab(additional_preprocessed, update=True) to update the existing vocabulary of the Word2Vec model with new words from additional training sentences.
Continue Training: The model is then trained again using the additional sentences with model.train(), which updates the word vectors based on the new data.
Save Fine-Tuned Model: After fine-tuning, the model is saved as "word2vec_fine_tuned.model".
Evaluation and Visualization: Similarity checks, analogy tasks, and visualization steps remain unchanged but will now reflect the adjustments made through fine-tuning.

Step 8: Visualize Word Embeddings

Output:

👁 fine_tune_word_embed_vis

Word embeddings visualization

Applications

Text classification: Word2Vec embeddings can be used as input features for machine learning models.
Sentiment analysis: Word embeddings help models capture the underlying sentiment in text.
Recommendation systems: Word2Vec can be used to recommend similar items by finding related words or phrases.
Document similarity: Embeddings allow for the comparison of documents by calculating vector distances between them.

Comment

Article Tags:

NLP

AI-ML-DS

AI-ML-DS With Python

Explore

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Courses

URL: https://www.geeksforgeeks.org/nlp/word2vec-with-gensim/