![]() |
VOOZH | about |
Sentiment analysis is crucial for understanding the emotional tone behind text data, making it invaluable for applications such as customer feedback analysis, social media monitoring, and market research. In this article, we will explore how to perform sentiment analysis using CatBoost.
Table of Content
CatBoost, a powerful gradient-boosting algorithm developed by Yandex, offers an efficient and accurate way to perform sentiment analysis. This article explores how to use CatBoost for sentiment analysis, highlighting its benefits, implementation steps, and practical applications.
For this example, we will use the IMDb dataset from the datasets library, which contains 50,000 movie reviews labeled as positive or negative. This dataset is readily available and well-suited for sentiment analysis.
We will be installing CatBoost library and Datasets module using the following command:
pip install catboost
pip install datasets
First, we load the IMDb dataset using the Hugging Face datasets library and separates it into training and test sets for further use in machine learning tasks. Specifically, train_data contains the reviews and labels for training, while test_data contains the reviews and labels for testing and evaluation.
from datasets import load_dataset
# Load the IMDb dataset
dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']
In the following code, we use TfidfVectorizer from the sklearn.feature_extraction.text module to convert the text data from the IMDb dataset into numerical feature vectors based on the TF-IDF scheme, limited to 5000 features. The fit_transform method is applied to the training data (train_data['text']) to learn the vocabulary and transform the text into TF-IDF features, while the transform method is applied to the test data (test_data['text']) to transform it using the same vocabulary. The labels for the training and test sets are extracted and stored in y_train and y_test, respectively, for use in model training and evaluation.
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['text'])
X_test = vectorizer.transform(test_data['text'])
y_train = train_data['label']
y_test = test_data['label']
Here, the code initializes a CatBoostClassifier with specified parameters (iterations, learning rate, depth, and verbosity) and fits the model to the TF-IDF transformed training data (X_train and y_train).
from catboost import CatBoostClassifier
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, verbose=100)
# Fit the model
model.fit(X_train, y_train)
After training the model, we predict the sentiments on the test set and evaluate the model's performance.
from sklearn.metrics import accuracy_score, classification_report
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))
Output:
Accuracy: 0.8766
precision recall f1-score support
0 0.89 0.86 0.88 12500
1 0.87 0.89 0.88 12500
accuracy 0.88 25000
macro avg 0.88 0.88 0.88 25000
weighted avg 0.88 0.88 0.88 25000
Using CatBoost for sentiment analysis on the IMDb dataset yields a high-performance model with excellent accuracy and balanced classification metrics. This demonstrates CatBoost's effectiveness and efficiency in handling textual data for sentiment analysis tasks.