Sentiment Analysis using CatBoost

Last Updated : 24 May, 2024

Sentiment analysis is crucial for understanding the emotional tone behind text data, making it invaluable for applications such as customer feedback analysis, social media monitoring, and market research. In this article, we will explore how to perform sentiment analysis using CatBoost.

Table of Content

Conclusion

Key Features of CatBoost

Handling Categorical Features: CatBoost natively supports categorical features, eliminating the need for one-hot encoding and reducing the risk of overfitting.
Robustness to Overfitting: The library employs techniques like ordered boosting and oblivious trees to mitigate overfitting, crucial for handling the high dimensionality and sparsity of text data.
Ease of Use: With its user-friendly API and minimal parameter tuning, CatBoost simplifies the implementation of sentiment analysis models.

Why use to CatBoost for Sentiment Analysis?

CatBoost, a powerful gradient-boosting algorithm developed by Yandex, offers an efficient and accurate way to perform sentiment analysis. This article explores how to use CatBoost for sentiment analysis, highlighting its benefits, implementation steps, and practical applications.

Ease of Use: CatBoost's ability to handle categorical features simplifies the preprocessing pipeline, especially for text data.
Accuracy: Its advanced boosting techniques ensure high accuracy and robustness in predictions.
Efficiency: CatBoost is optimized for fast training and prediction, making it suitable for large datasets.

Implementing Sentimental Analysis with CatBoost

For this example, we will use the IMDb dataset from the datasets library, which contains 50,000 movie reviews labeled as positive or negative. This dataset is readily available and well-suited for sentiment analysis.

Step 1: Install Necessary Libraries

We will be installing CatBoost library and Datasets module using the following command:

pip install catboost
pip install datasets

Step 2: Load Dataset

First, we load the IMDb dataset using the Hugging Face datasets library and separates it into training and test sets for further use in machine learning tasks. Specifically, train_data contains the reviews and labels for training, while test_data contains the reviews and labels for testing and evaluation.

from datasets import load_dataset

# Load the IMDb dataset
dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']

Step 3: Text Preprocessing using TF-IDF

In the following code, we use TfidfVectorizer from the sklearn.feature_extraction.text module to convert the text data from the IMDb dataset into numerical feature vectors based on the TF-IDF scheme, limited to 5000 features. The fit_transform method is applied to the training data (train_data['text']) to learn the vocabulary and transform the text into TF-IDF features, while the transform method is applied to the test data (test_data['text']) to transform it using the same vocabulary. The labels for the training and test sets are extracted and stored in y_train and y_test, respectively, for use in model training and evaluation.

from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['text'])
X_test = vectorizer.transform(test_data['text'])

y_train = train_data['label']
y_test = test_data['label']

Step 4: Model Training

Here, the code initializes a CatBoostClassifier with specified parameters (iterations, learning rate, depth, and verbosity) and fits the model to the TF-IDF transformed training data (X_train and y_train).

from catboost import CatBoostClassifier

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, verbose=100)

# Fit the model
model.fit(X_train, y_train)

Step 5: Model Training

After training the model, we predict the sentiments on the test set and evaluate the model's performance.

from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

Complete Code for Sentimental Analysis using CatBoost

Output:

Accuracy: 0.8766
 precision recall f1-score support

 0 0.89 0.86 0.88 12500
 1 0.87 0.89 0.88 12500

 accuracy 0.88 25000
 macro avg 0.88 0.88 0.88 25000
weighted avg 0.88 0.88 0.88 25000

Conclusion

Using CatBoost for sentiment analysis on the IMDb dataset yields a high-performance model with excellent accuracy and balanced classification metrics. This demonstrates CatBoost's effectiveness and efficiency in handling textual data for sentiment analysis tasks.

Comment

Article Tags:

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/sentiment-analysis-using-catboost/