![]() |
VOOZH | about |
The TfidfVectorizer in scikit-learn is a powerful tool for converting text data into numerical features, making it essential for many Natural Language Processing (NLP) tasks. Once you have fitted and transformed your data with TfidfVectorizer, you might want to save the vectorizer for future use.
This guide will show you how to store a TfidfVectorizer using scikit-learn and load it later for transforming new text data.
The TfidfVectorizer is a feature extraction technique in the scikit-learn library for converting a collection of raw text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. This is a common step in Natural Language Processing (NLP) and text mining tasks to transform text data into numerical data that machine learning algorithms can work with.
The TF-IDF score for a term t in a document d is calculated as:
Where:
Where:
TF-IDF evaluates how important a word is to a document in a collection. Storing a TfidfVectorizer can be useful when you need to preprocess text data in a consistent way across different sessions or applications.
Import the necessary libraries. TfidfVectorizer from sklearn is used for transforming text data into TF-IDF features. pickle and joblib are used for saving and loading the vectorizer model.
from sklearn.feature_extraction.text import TfidfVectorizer
from joblib import dump, load
import pickle
Define a list of sample text documents. These documents will be used to fit the TfidfVectorizer.
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"
]
Create an instance of TfidfVectorizer and fit it to the sample documents. The fit_transform method learns the vocabulary and idf from the documents and returns the transformed TF-IDF matrix.
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
Save the fitted TfidfVectorizer to a file using pickle. This allows the vectorizer to be reused later without needing to refit it to the data.
with open('tfidf_vectorizer.pkl', 'wb') as file:
pickle.dump(vectorizer, file)
Load the saved TfidfVectorizer from the file using pickle. This restores the vectorizer to its state when it was saved.
with open('tfidf_vectorizer.pkl', 'rb') as file:
loaded_vectorizer_pickle = pickle.load(file)
Save the fitted TfidfVectorizer to a file using joblib. joblib is optimized for storing large numpy arrays, making it a good choice for saving scikit-learn models.
dump(vectorizer, 'tfidf_vectorizer.joblib')Load the saved TfidfVectorizer from the file using joblib. This restores the vectorizer to its state when it was saved.
loaded_vectorizer_joblib = load('tfidf_vectorizer.joblib')Define a list of new text documents. These documents will be transformed using the loaded vectorizers.
new_documents = [
"This is a new document.",
"This document is different from the others."
]
Transform the new text documents using the vectorizer loaded from the pickle file. This converts the new documents into TF-IDF features.
X_new_pickle = loaded_vectorizer_pickle.transform(new_documents)
Transform the new text documents using the vectorizer loaded from the joblib file. This also converts the new documents into TF-IDF features.
X_new_joblib = loaded_vectorizer_joblib.transform(new_documents)Print the feature names and the transformed data. This allows you to see the features (terms) extracted by the TfidfVectorizer and the TF-IDF values for both the original and new documents.
print("Feature names:")
print(vectorizer.get_feature_names_out())
print("\nOriginal transformed data:")
print(X.toarray())
print("\nTransformed new data using loaded vectorizer from pickle:")
print(X_new_pickle.toarray())
print("\nTransformed new data using loaded vectorizer from joblib:")
print(X_new_joblib.toarray())
Output:
Feature names:
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Original transformed data:
[[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]
[0. 0.6876236 0. 0.28108867 0. 0.53864762
0.28108867 0. 0.28108867]
[0.51184851 0. 0. 0.26710379 0.51184851 0.
0.26710379 0.51184851 0.26710379]
[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]]
Transformed new data using loaded vectorizer from pickle:
[[0. 0.65416415 0. 0.53482206 0. 0.
0. 0. 0.53482206]
[0. 0.57684669 0. 0.47160997 0. 0.
0.47160997 0. 0.47160997]]
Transformed new data using loaded vectorizer from joblib:
[[0. 0.65416415 0. 0.53482206 0. 0.
0. 0. 0.53482206]
[0. 0.57684669 0. 0.47160997 0. 0.
0.47160997 0. 0.47160997]]
The output represents:
Both the pickle and joblib methods successfully store and restore the TfidfVectorizer, allowing for consistent transformation of new data.
Storing a TfidfVectorizer for future use is a practical approach to ensure consistency in text data preprocessing. Whether you use pickle or joblib, the process is straightforward and can save time in your machine learning workflow.