VOOZH about

URL: https://www.analyticsvidhya.com/blog/2024/07/tf-idf-matrix/

⇱ Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

How Do You Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer?

Janvi Kumari Last Updated : 25 Feb, 2025
5 min read

Understanding the significance of a word in a text is crucial for analyzing and interpreting large volumes of data. This is where the term frequency-inverse document frequency (TF-IDF) technique in Natural Language Processing (NLP) comes into play. By overcoming the limitations of the traditional bag of words approach, TF-IDF enhances text classification and bolsters machine learning models’ ability to comprehend and analyze textual information effectively. This article will show you how to build a TF-IDF model from scratch in Python and how to compute it numerically.

Terminology: Key Terms Used in TF-IDF

Before diving into the calculations and code, it’s essential to understand the key terms:

  • t: term (word)
  • d: document (set of words)
  • N: count of corpus
  • corpus: the total document set

What is Term Frequency (TF)?

The frequency with which a term occurs in a document is measured by term frequency (TF). A term’s weight in a document is directly correlated with its frequency of occurrence. The TF formula is:

What is Document Frequency (DF)?

The significance of a document within a corpus is gauged by its Document Frequency (DF). DF counts the number of papers that contain the phrase at least once, as opposed to TF, which counts the instances of a term in a document. The DF formula is:

DF(t)=occurrence of t in documents

What is Inverse Document Frequency (IDF)?

The informativeness of a word is measured by its inverse document frequency, or IDF. All terms are given identical weight while calculating TF, although IDF helps scale up uncommon terms and weigh down common ones (like stop words). The IDF formula is:

where N is the total number of documents and DF(t) is the number of documents containing the term t.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It combines the importance of a term in a document (TF) with the term’s rarity across the corpus (IDF). The formula is:

👁 TF-IDF formula

Numerical Calculation of TF-IDF

Let’s break down the numerical calculation of TF-IDF for the given documents:

Documents:

  1. “The sky is blue.”
  2. “The sun is bright today.”
  3. “The sun in the sky is bright.”
  4. “We can see the shining sun, the bright sun.”

Step 1: Calculate Term Frequency (TF)

Document 1: “The sky is blue.”

TermCountTF
the11/4
sky11/4
is11/4
blue11/4

Document 2: “The sun is bright today.”

TermCountTF
the11/5
sun11/5
is11/5
bright11/5
today11/5

Document 3: “The sun in the sky is bright.”

TermCountTF
the22/7
sun11/7
in11/7
sky11/7
is11/7
bright11/7

Document 4: “We can see the shining sun, the bright sun.”

TermCountTF
we11/9
can11/9
see11/9
the22/9
shining11/9
sun22/9
bright11/9

Step 2: Calculate Inverse Document Frequency (IDF)

Using N=4N = 4N=4:

TermDFIDF
the4log⁡(4/4+1)=log⁡(0.8)≈−0.223
sky2log⁡(4/2+1)=log⁡(1.333)≈0.287
is3log⁡(4/3+1)=log⁡(1)=0
blue1log⁡(4/1+1)=log⁡(2)≈0.693
sun3log⁡(4/3+1)=log⁡(1)=0
bright3log⁡(4/3+1)=log⁡(1)=0
today1log⁡(4/1+1)=log⁡(2)≈0.693
in1log⁡(4/1+1)=log⁡(2)≈0.693
we1log⁡(4/1+1)=log⁡(2)≈0.693
can1log⁡(4/1+1)=log⁡(2)≈0.693
see1log⁡(4/1+1)=log⁡(2)≈0.693
shining1log⁡(4/1+1)=log⁡(2)≈0.693

Step 3: Calculate TF-IDF

Now, let’s calculate the TF-IDF values for each term in each document.

Document 1: “The sky is blue.”

TermTFIDFTF-IDF
the0.25-0.2230.25 * -0.223 ≈-0.056
sky0.250.2870.25 * 0.287 ≈ 0.072
is0.2500.25 * 0 = 0
blue0.250.6930.25 * 0.693 ≈ 0.173

Document 2: “The sun is bright today.”

TermTFIDFTF-IDF
the0.2-0.2230.2 * -0.223 ≈ -0.045
sun0.200.2 * 0 = 0
is0.200.2 * 0 = 0
bright0.200.2 * 0 = 0
today0.20.6930.2 * 0.693 ≈0.139

Document 3: “The sun in the sky is bright.”

TermTFIDFTF-IDF
the0.285-0.2230.285 * -0.223 ≈ -0.064
sun0.14200.142 * 0 = 0
in0.1420.6930.142 * 0.693 ≈0.098
sky0.1420.2870.142 * 0.287≈0.041
is0.14200.142 * 0 = 0
bright0.14200.142 * 0 = 0

Document 4: “We can see the shining sun, the bright sun.”

TermTFIDFTF-IDF
we0.1110.6930.111 * 0.693 ≈0.077
can0.1110.6930.111 * 0.693 ≈0.077
see0.1110.6930.111 * 0.693≈0.077
the0.222-0.2230.222 * -0.223≈-0.049
shining0.1110.6930.111 * 0.693 ≈0.077
sun0.22200.222 * 0 = 0
bright0.11100.111 * 0 = 0

TF-IDF Implementation in Python Using an Inbuilt Dataset

Now let’s apply the TF-IDF calculation using the TfidfVectorizer from scikit-learn with an inbuilt dataset.

Step 1: Install Necessary Libraries

Ensure you have scikit-learn installed:

pip install scikit-learn

Step 2: Import Libraries

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

Step 3: Load the Dataset

Fetch the 20 Newsgroups dataset:

newsgroups = fetch_20newsgroups(subset='train')

Step 4: Initialize TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

Step 5: Fit and Transform the Documents

Convert the text documents to a TF-IDF matrix:

tfidf_matrix = vectorizer.fit_transform(newsgroups.data)

Step 6: View the TF-IDF Matrix

Convert the matrix to a DataFrame for better readability:

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df_tfidf.head()
👁 TF-IDF Matrix

Conclusion

By using the 20 Newsgroups dataset and TfidfVectorizer, you can convert a large collection of text documents into a TF-IDF matrix. This matrix numerically represents the importance of each term in each document, facilitating various NLP tasks such as text classification, clustering, and more advanced text analysis. The TfidfVectorizer from scikit-learn provides an efficient and straightforward way to achieve this transformation.

Frequently Asked Questions

Q1. Why do we take the log of IDF?

Ans. A: Taking the log of IDF helps to scale down the effect of extremely common words and prevent the IDF values from exploding, especially in large corpora. It ensures that IDF values remain manageable and reduces the impact of words that appear very frequently across documents.

Q2. Can TF-IDF be used for large datasets?

Ans. Yes, TF-IDF can be used for large datasets. However, efficient implementation and adequate computational resources are required to handle the large matrix computations involved.

Q3. What’s the limitation of TF-IDF?

Ans. The TF-IDF’s limitation is that it doesn’t account for word order or context, treating each term independently and thus potentially missing the nuanced meaning of phrases or the relationship between words.

Q4. What are some applications of TF-IDF?

Ans. TF-IDF is used in various applications, including:
1. Search engines to rank documents based on relevance to a query
2. Text classification to identify the most significant words for categorizing documents
3. Clustering to group similar documents based on key terms
4. Text summarization to extract important sentences from a document

Hi, I am Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.

Login to continue reading and enjoy expert-curated content.

Free Courses

Learn to Build Intelligent Chatbots using AI

Build ethical chatbots via OpenAI & LangChain using PDF data.

Getting Started with DeepSeek-AI

DeepSeek is trending for its open-source AI, rivaling top models.

Nano Course Cutting Edge LLM Tricks

Learn cutting-edge LLM tricks from research. Build state-of-the-art LLMs.

Mastering Multilingual GenAI Open-Weight for Indic Language

Master Multilingual GenAI with open-weight models for Indic languages.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner