![]() |
VOOZH | about |
Complement Naive Bayes (CNB) is a variant of the Naive Bayes algorithm that is specifically designed to improve classification performance on imbalanced datasets and text classification tasks. It modifies the way probabilities are estimated to reduce bias towards majority classes, making it more suitable than the standard Multinomial Naive Bayes in many cases.
An unbalanced dataset means one type of data appears much more often than the other. This often happens in spam filtering (more normal emails than spam) or medical diagnosis (more healthy cases than disease cases).
Example:
If 95% of cases are "not fraud" and only 5% are "fraud," a model that always predicts "not fraud" will be 95% accurate but will miss all fraud cases. This shows why special methods are needed to deal with such uneven data.
For a class C and feature F:
Suppose classifying sentences as Apples or Bananas using word frequencies, To classify a new sentence (Round=1, Red=1, Soft=1):
Solving by CNB: We classify a new sentence with features {Round =1, Red =1, Soft =1} and vocabulary {Round, Red, Soft}.
Step 1: Complement counts
Step 2: Probabilities (using Laplace smoothing, α =1)
For Apples:
For Bananas:
Step 3: Scores, Multiply feature probabilities:
Final Result -> Bananas
We can implement CNB using scikit-learn on the wine dataset (for demonstration purposes).
We will import and load the required libraries
We will split the dataset into training and test sets:
We will train the Complement Naive Bayes classifier
We will now evaluate the trained model:
Note: CNB is better suited for discrete data like text. For continuous features (as in this dataset), Gaussian Naive Bayes might perform better.
| Scenario | Why CNB is Suitable |
|---|---|
| Imbalanced class distributions | The complement approach ensures minority classes receive fairer parameter estimates. |
| Text classification | CNB handles discrete feature counts (e.g., word frequencies) very effectively. |
| Large feature spaces | CNB is computationally efficient and easy to interpret, even with many features. |