VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/ml-one-hot-encoding/

⇱ One Hot Encoding in Machine Learning - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

One Hot Encoding in Machine Learning

Last Updated : 29 May, 2026

One-Hot Encoding is a data preprocessing technique used to convert categorical data into a numerical format that machine learning models can understand. It creates separate binary columns for each category, where 1 represents the presence of a category and 0 represents its absence.

  • Converts categorical values into binary columns
  • Prevents models from assuming an incorrect order between categories
  • Improves machine learning model performance
  • Helps capture relationships between categorical features
  • Required for many machine learning algorithms that accept numerical input only

Working of One-Hot Encoding

One-Hot Encoding creates a separate column for each category in the dataset. In the fruit example, when the fruit is Apple, the Fruit_Apple column gets the value 1 while the other fruit columns contain 0. Similarly, for Mango and Orange, their respective columns contain 1 and the remaining columns contain 0.

  • Each category gets its own binary column
  • 1 indicates the presence of a category
  • 0 indicates the absence of a category
  • Converts categorical values into numerical format for machine learning models
FruitCategorical value of fruitPrice
apple15
mango210
apple115
orange320

The output after applying one-hot encoding on the data is given as follows

Fruit_appleFruit_mangoFruit_orangeprice
1005
01010
10015
00120

Implementation

One-Hot Encoding can be implemented in Python using libraries such as Pandas and Scikit-learn, which provide simple and efficient methods for converting categorical data into binary columns.

1. Using Pandas

Pandas provides the get_dummies() function to perform one-hot encoding on categorical columns.

  • Converts categorical values into binary columns
  • Easy and efficient for preprocessing datasets
  • drop_first=True removes one redundant column to avoid multicollinearity
  • Example: Gender with values M and F becomes Gender_M and Gender_F columns

Output:

👁 output99
Output

2. Using Scikit Learn Library

Scikit-learn (sklearn) provides the OneHotEncoder function to convert categorical variables into binary columns for machine learning models.

  • Converts categorical data into binary format
  • Automatically identifies categories
  • Useful for machine learning preprocessing
  • select_dtypes() helps select categorical columns automatically

Output:

👁 output10
Output

Download full code from here

Advantages

  • Converts categorical data into numerical format for machine learning models
  • Helps improve model performance by representing categories clearly
  • Prevents incorrect ordinal relationships between categories

Limitations

  • Increases the number of columns in the dataset
  • Can create sparse datasets with many 0 values
  • May lead to overfitting when categories are too many and data is limited
Comment