![]() |
VOOZH | about |
In the world of machine learning, Gaussian Naive Bayes is a simple yet powerful algorithm used for classification tasks. It belongs to the Naive Bayes algorithm family, which uses Bayes' Theorem as its foundation. The goal of this post is to explain the Gaussian Naive Bayes classifier and offer a detailed implementation tutorial for Python users utilizing the Sklearn module.
A family of algorithms known as "naive Bayes classifiers" use the Bayes Theorem with the strong (naive) presumption that every feature in the dataset is unrelated to every other feature. Naive Bayes classifiers perform very well in a variety of real-world situations despite this simplicity. The Naive Bayes classifier is a probabilistic algorithm based on Bayes' theorem. It assumes that features are conditionally independent, given the class label. Despite its 'naive' assumption, Naive Bayes often performs well in various real-world scenarios.
The probabilistic classification algorithm Gaussian Naive Bayes (GNB) is founded on the Bayes theorem. Given the class label, it is assumed that features follow a Gaussian distribution and are conditionally independent. For continuous data, GNB is especially helpful. The algorithm calculates the variance and mean of each feature for every class during training. During the prediction stage, it determines which class an instance is most likely to belong to by calculating the probability of each class. Text classification and spam filtering are just two of the many applications that can benefit from GNB's computational efficiency and ability to handle high-dimensional datasets.
The Bayes Theorem allows us to calculate the probability of an event based on the likelihood of a previous occurrence. The theorem is expressed mathematically as:
Where:
The Gaussian Naive Bayes classifier is one of several algorithms available in machine learning that may be used to tackle a wide range of issues. This article uses the well-known Scikit-Learn package (Sklearn) to walk readers who are new to data science and machine learning through the basic ideas of Gaussian Naive Bayes. We will go over the fundamental ideas, important vocabulary, and useful examples to help you grasp.
Gaussian Naive Bayes (GNB) uses Gaussian (normal) distributions to represent the probability distribution of features within each class. Estimating the mean (μ) and variance (σ2 ) for every feature in every class is part of the representation for a dataset with m features and n classes.
Mathematically, the Gaussian distribution for a feature Xi in class Cj is represented as follows:
Where,
We’ll start by creating a synthetic dataset suitable for classification. The make_classification function in Sklearn will be used to create a dataset with two features.
Output:
Now, we’ll train the Gaussian Naive Bayes model using the synthetic dataset.
Output:
Accuracy: 0.9666666666666667The code performs Naive Bayes classification using scikit-learn and handles data using pandas. Labels are encoded, data is divided into training and testing sets, a Gaussian Naive Bayes classifier is trained, and the accuracy of the classifier is assessed.
We’ll start by loading the Census Income dataset from the UCI Machine Learning Repository.
Output:
age workclass fnlwgt education education-num \
0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13
marital-status occupation relationship race sex \
0 Never-married Adm-clerical Not-in-family White Male
1 Married-civ-spouse Exec-managerial Husband White Male
2 Divorced Handlers-cleaners Not-in-family White Male
3 Married-civ-spouse Handlers-cleaners Husband Black Male
4 Married-civ-spouse Prof-specialty Wife Black Female
capital-gain capital-loss hours-per-week native-country income
0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K
Before we can train our model, we need to preprocess the data. This includes converting categorical variables into numerical values and normalizing the continuous variables.
Output:
age workclass fnlwgt education education-num marital-status \
0 0.301370 7 0.044302 9 0.800000 4
1 0.452055 6 0.048238 9 0.800000 2
2 0.287671 4 0.138113 11 0.533333 0
3 0.493151 4 0.151068 1 0.400000 2
4 0.150685 4 0.221488 9 0.800000 2
occupation relationship race sex capital-gain capital-loss \
0 1 1 4 1 0.02174 0.0
1 4 0 4 1 0.00000 0.0
2 6 1 4 1 0.00000 0.0
3 6 0 2 1 0.00000 0.0
4 10 5 2 0 0.00000 0.0
hours-per-week native-country income
0 0.397959 39 0
1 0.122449 39 0
2 0.397959 39 0
3 0.397959 39 0
4 0.397959 5 0
With our data preprocessed, we can now train the Gaussian Naive Bayes model.
Output:
Accuracy: 0.8086805200122837
This example shows how to use the Census Income dataset to apply Gaussian Naive Bayes. You may use this approach to forecast income levels based on employment and demographic characteristics by following these steps.
In this article, we've introduced the Gaussian Naive Bayes classifier and demonstrated its implementation using Scikit-Learn. Understanding the basics of this algorithm, key terminologies, and following the provided steps will empower you to apply Gaussian Naive Bayes to your own projects. As you continue your journey into machine learning, this knowledge will serve as a valuable foundation for more advanced concepts and techniques.