Understanding the significance of a word in a text is crucial for analyzing and interpreting large volumes of data. This is where the term frequency-inverse document frequency (TF-IDF) technique in Natural Language Processing (NLP) comes into play. By overcoming the limitations of the traditional bag of words approach, TF-IDF enhances text classification and bolsters machine learning models’ ability to comprehend and analyze textual information effectively. This article will show you how to build a TF-IDF model from scratch in Python and how to compute it numerically.

Terminology: Key Terms Used in TF-IDF

Before diving into the calculations and code, it’s essential to understand the key terms:

t: term (word)
d: document (set of words)
N: count of corpus
corpus: the total document set

What is Term Frequency (TF)?

The frequency with which a term occurs in a document is measured by term frequency (TF). A term’s weight in a document is directly correlated with its frequency of occurrence. The TF formula is:

👁 Term Frequency (TF) in TF-IDF

What is Document Frequency (DF)?

The significance of a document within a corpus is gauged by its Document Frequency (DF). DF counts the number of papers that contain the phrase at least once, as opposed to TF, which counts the instances of a term in a document. The DF formula is:

DF(t)=occurrence of t in documents

What is Inverse Document Frequency (IDF)?

The informativeness of a word is measured by its inverse document frequency, or IDF. All terms are given identical weight while calculating TF, although IDF helps scale up uncommon terms and weigh down common ones (like stop words). The IDF formula is:

👁 What is Inverse Document Frequency (IDF)

where N is the total number of documents and DF(t) is the number of documents containing the term t.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It combines the importance of a term in a document (TF) with the term’s rarity across the corpus (IDF). The formula is:

👁 TF-IDF formula

Numerical Calculation of TF-IDF

Let’s break down the numerical calculation of TF-IDF for the given documents:

Documents:

“The sky is blue.”
“The sun is bright today.”
“The sun in the sky is bright.”
“We can see the shining sun, the bright sun.”

Step 1: Calculate Term Frequency (TF)

Document 1: “The sky is blue.”

Term	Count	TF
the	1	1/4
sky	1	1/4
is	1	1/4
blue	1	1/4

Document 2: “The sun is bright today.”

Term	Count	TF
the	1	1/5
sun	1	1/5
is	1	1/5
bright	1	1/5
today	1	1/5

Document 3: “The sun in the sky is bright.”

Term	Count	TF
the	2	2/7
sun	1	1/7
in	1	1/7
sky	1	1/7
is	1	1/7
bright	1	1/7

Document 4: “We can see the shining sun, the bright sun.”

Term	Count	TF
we	1	1/9
can	1	1/9
see	1	1/9
the	2	2/9
shining	1	1/9
sun	2	2/9
bright	1	1/9

Step 2: Calculate Inverse Document Frequency (IDF)

Using N=4N = 4N=4:

Term	DF	IDF
the	4	log⁡(4/4+1)=log⁡(0.8)≈−0.223
sky	2	log⁡(4/2+1)=log⁡(1.333)≈0.287
is	3	log⁡(4/3+1)=log⁡(1)=0
blue	1	log⁡(4/1+1)=log⁡(2)≈0.693
sun	3	log⁡(4/3+1)=log⁡(1)=0
bright	3	log⁡(4/3+1)=log⁡(1)=0
today	1	log⁡(4/1+1)=log⁡(2)≈0.693
in	1	log⁡(4/1+1)=log⁡(2)≈0.693
we	1	log⁡(4/1+1)=log⁡(2)≈0.693
can	1	log⁡(4/1+1)=log⁡(2)≈0.693
see	1	log⁡(4/1+1)=log⁡(2)≈0.693
shining	1	log⁡(4/1+1)=log⁡(2)≈0.693

Step 3: Calculate TF-IDF

Now, let’s calculate the TF-IDF values for each term in each document.

Document 1: “The sky is blue.”

Term	TF	IDF	TF-IDF
the	0.25	-0.223	0.25 * -0.223 ≈-0.056
sky	0.25	0.287	0.25 * 0.287 ≈ 0.072
is	0.25	0	0.25 * 0 = 0
blue	0.25	0.693	0.25 * 0.693 ≈ 0.173

Document 2: “The sun is bright today.”

Term	TF	IDF	TF-IDF
the	0.2	-0.223	0.2 * -0.223 ≈ -0.045
sun	0.2	0	0.2 * 0 = 0
is	0.2	0	0.2 * 0 = 0
bright	0.2	0	0.2 * 0 = 0
today	0.2	0.693	0.2 * 0.693 ≈0.139

Document 3: “The sun in the sky is bright.”

Term	TF	IDF	TF-IDF
the	0.285	-0.223	0.285 * -0.223 ≈ -0.064
sun	0.142	0	0.142 * 0 = 0
in	0.142	0.693	0.142 * 0.693 ≈0.098
sky	0.142	0.287	0.142 * 0.287≈0.041
is	0.142	0	0.142 * 0 = 0
bright	0.142	0	0.142 * 0 = 0

Document 4: “We can see the shining sun, the bright sun.”

Term	TF	IDF	TF-IDF
we	0.111	0.693	0.111 * 0.693 ≈0.077
can	0.111	0.693	0.111 * 0.693 ≈0.077
see	0.111	0.693	0.111 * 0.693≈0.077
the	0.222	-0.223	0.222 * -0.223≈-0.049
shining	0.111	0.693	0.111 * 0.693 ≈0.077
sun	0.222	0	0.222 * 0 = 0
bright	0.111	0	0.111 * 0 = 0

TF-IDF Implementation in Python Using an Inbuilt Dataset

Now let’s apply the TF-IDF calculation using the TfidfVectorizer from scikit-learn with an inbuilt dataset.

Step 1: Install Necessary Libraries

Ensure you have scikit-learn installed:

pip install scikit-learn

Step 2: Import Libraries

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

Step 3: Load the Dataset

Fetch the 20 Newsgroups dataset:

newsgroups = fetch_20newsgroups(subset='train')

Step 4: Initialize TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

Step 5: Fit and Transform the Documents

Convert the text documents to a TF-IDF matrix:

tfidf_matrix = vectorizer.fit_transform(newsgroups.data)

Step 6: View the TF-IDF Matrix

Convert the matrix to a DataFrame for better readability:

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df_tfidf.head()

👁 TF-IDF Matrix

Conclusion

By using the 20 Newsgroups dataset and TfidfVectorizer, you can convert a large collection of text documents into a TF-IDF matrix. This matrix numerically represents the importance of each term in each document, facilitating various NLP tasks such as text classification, clustering, and more advanced text analysis. The TfidfVectorizer from scikit-learn provides an efficient and straightforward way to achieve this transformation.

Frequently Asked Questions

Q1. Why do we take the log of IDF?

Ans. A: Taking the log of IDF helps to scale down the effect of extremely common words and prevent the IDF values from exploding, especially in large corpora. It ensures that IDF values remain manageable and reduces the impact of words that appear very frequently across documents.

Q2. Can TF-IDF be used for large datasets?

Ans. Yes, TF-IDF can be used for large datasets. However, efficient implementation and adequate computational resources are required to handle the large matrix computations involved.

Q3. What’s the limitation of TF-IDF?

Ans. The TF-IDF’s limitation is that it doesn’t account for word order or context, treating each term independently and thus potentially missing the nuanced meaning of phrases or the relationship between words.

Q4. What are some applications of TF-IDF?

Ans. TF-IDF is used in various applications, including:
1. Search engines to rank documents based on relevance to a query
2. Text classification to identify the most significant words for categorizing documents
3. Clustering to group similar documents based on key terms
4. Text summarization to extract important sentences from a document

👁 Janvi Kumari

Janvi Kumari

Hi, I am Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.

Beginner NLP Python Python

Login to continue reading and enjoy expert-curated content.

Free Courses

👁 Generative AI
4.5

Learn to Build Intelligent Chatbots using AI

Build ethical chatbots via OpenAI & LangChain using PDF data.

👁 Generative AI
4.9

Getting Started with DeepSeek-AI

DeepSeek is trending for its open-source AI, rivaling top models.

👁 Generative AI
4.6

Nano Course Cutting Edge LLM Tricks

Learn cutting-edge LLM tricks from research. Build state-of-the-art LLMs.

👁 Generative AI
4.6

Mastering Multilingual GenAI Open-Weight for Indic Language

Master Multilingual GenAI with open-weight models for Indic languages.

Responses From Readers

Cancel reply

Become an Author

Share insights, grow your voice, and inspire the data community.

Reach a Global Audience
Share Your Expertise with the World
Build Your Brand & Audience

Join a Thriving AI Community
Level Up Your AI Game
Expand Your Influence in Genrative AI

👁 imag

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

👁 Av Logo White

Continue your learning for FREE

👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner

👁 AI Popup Banner

URL: https://www.analyticsvidhya.com/blog/2024/07/tf-idf-matrix/

⇱ Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer

Reading list

How Do You Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer?

Table of contents

Terminology: Key Terms Used in TF-IDF

What is Term Frequency (TF)?

What is Document Frequency (DF)?

What is Inverse Document Frequency (IDF)?

What is TF-IDF?

Numerical Calculation of TF-IDF

Documents:

Step 1: Calculate Term Frequency (TF)

Step 2: Calculate Inverse Document Frequency (IDF)

Step 3: Calculate TF-IDF

TF-IDF Implementation in Python Using an Inbuilt Dataset

Step 1: Install Necessary Libraries

Step 2: Import Libraries

Step 3: Load the Dataset

Step 4: Initialize TfidfVectorizer

Step 5: Fit and Transform the Documents

Step 6: View the TF-IDF Matrix

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Learn to Build Intelligent Chatbots using AI

Getting Started with DeepSeek-AI

Nano Course Cutting Edge LLM Tricks

Mastering Multilingual GenAI Open-Weight for Indic Language

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Continue your learning for FREE

Enter OTP sent to

Enter the OTP