VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/handling-categorical-data-in-python/

⇱ Handling Categorical Data in Python - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Handling Categorical Data in Python

Last Updated : 14 Oct, 2025

Categorical data refers to features that contain a fixed set of possible values or categories that data points can belong to. Handling categorical data correctly is important because improper handling can lead to inaccurate analysis and poor model performance. In this article, we will see how to handle categorical data and its related concepts.

Why Do We Need to Handle Categorical Data?

Handling categorical data is important because:

  1. Algorithms Require Numerical Inputs: Most machine learning algorithms cannot directly process categorical data and need it to be converted into numerical formats.
  2. Inconsistent Categories: Categorical data contains inconsistencies like typos, case sensitivity or alternate spellings. We must standardize these to avoid treating them as separate categories.
  3. Remapping Categories: Some categories might need to be grouped for simplicity and relevance. For example, remapping rare categories into an "Other" group.
  4. Improves Model Performance: Proper encoding techniques like one-hot encoding or label encoding help models to understand the relationships of categories leading to better predictions.
  5. Handles Real-World Complexity: It is used in many domains such as E-commerce, Finance, Healthcare, etc making it robust to handle important features.

Implementation for Handling Categorical Data

Here we will be using a Demographics dataset which has some incorrect, invalid or meaningless data (bogus values) due to human error while filling survey form or any other reason. You can download dataset from here.

Step 1: Importing necessary Libraries

We will be using Numpy, Pandas, Matplotlib, Seaborn and Scikit-learn libraries for its implementation.

Step 2: Loading the Dataset

We load the dataset into a Pandas DataFrame for manipulation.

Output:

👁 Handling Categorical Data in Python
First five rows of the dataset

Step 3: Identifying and Removing Bogus Blood Types

First we create a DataFrame containing all valid blood types to check for bogus values in the dataset:

Output:

👁 Handling Categorical Data in Python

Lets find bogus bloodtypes by comparing the dataset values to this valid list:

Output:

{'C+', 'D-'}

Once the bogus values are found the corresponding rows can be dropped from the dataset.

Output:

array(['A+', 'B+', 'A-', 'AB-', 'AB+', 'B-', 'O-', 'O+'], dtype=object)

Step 4: Handling Inconsistent Marriage Status Categories

Checking the unique values in the marriage_status column:

Output:

array(['married', 'MARRIED', ' married', 'unmarried ', 'divorced', 'unmarried', 'UNMARRIED', 'separated'], dtype=object)

Standardizing the categories by converting all text to lowercase.

Output:

array(['married', ' married', 'unmarried ', 'divorced', 'unmarried', 'separated'], dtype=object)

Now we will standardize the categories by stripping extra spaces:

Output:

array(['married', 'unmarried', 'divorced', 'separated'], dtype=object)

Step 5: Grouping Income into Meaningful Bins

Numerical data like age or income can be mapped to different groups. Let us check income range to define bin intervals:

Output:

Max income - 190000, Min income - 40000

Now, let us create the range and labels for the income feature. Pandas cut method is used here.

Output:

👁 Handling Categorical Data in Python
First five rows of the dataset.

Step 6: Visualizing Income Group Distribution

Now lets visualize the distribution of income groups:

Output:

👁 a1
Visualize the distribution

Step 7: Cleaning Phone Number Data

Simulating phone numbers with inconsistent formats and cleaning them:

Output:

👁 a2
Phone numbers Created

Based on the use case the country code before numbers could be dropped or added for missing ones. Similarly phone numbers with less than 10 numbers should be discarded.

Output:

👁 A3
After Phone numbers discarded

Finally we can verify whether the data is clean or not.

Step 8: Visualizing Categorical Data

Various plots could be used to visualize categorical data to get more insights about the data. So let us visualize the number of people belonging to each blood type.

Output:

👁 a4
Visualizing Categorical Data

Now we can see the relationship between income and the marital status of a person using a boxplot

Output:

👁 a5

Step 9: Encoding Categorical Data

Certain learning algorithms like regression and neural networks require their input to be numbers. Hence categorical data must be converted to numbers to use these algorithms. Let us see some encoding methods.

1. Label Encoding

With label encoding we can number the categories from 0 to num_categories - 1. Let us apply label encoding on the blood type feature.

Output:

👁 a6
Label Encoding

2. One-hot Encoding in Python

There are certain limitations of label encoding that are taken care of by one-hot encoding. Some of them are:

  • Creates a false order: It gives numbers like 0, 1, 2 to categories which may make models think one category is bigger or better than the other.
  • Misleads models: Algorithms like linear regression or decision trees might assume there's a ranking which can reduce accuracy.
  • Problem with distance-based models: In models like KNN or K-Means, the numeric labels can wrongly influence distance calculations.
  • Bias in training: Some models may give more importance to higher label values, even if all categories are equal.
  • Not suitable for nominal data: Label encoding is not a good choice when categories have no natural order, like colors or city names.

Output:

👁 a7
One-hot Encoding

3. Ordinal Encoding in Python

Categorical data can be ordinal where the order is of importance. For such features, we want to preserve the order after encoding as well. We will perform ordinal encoding on income groups. We want to preserve the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+

Output:

👁 a8
Ordinal Encoding

With these techniques we can prepare categorical data for meaningful analysis and effective machine learning models.

Comment