![]() |
VOOZH | about |
As we all know that better encoding leads to a better model and most algorithms cannot handle the categorical variables unless they are converted into a numerical value.
Categorical features are generally divided into 3 types:
A. Binary: Either/or
Examples:
B. Ordinal: Specific ordered Groups.
Examples:
C. Nominal: Unordered Groups. Examples
Dataset: To download the file click on the link encoding dataset
Example:
Output:
Let's examine the columns of the dataset with different types of encoding techniques.
Code: Mapping binary features present in the dataset.
Output:
Label Encoding: Label encoding algorithm is quite simple and it considers an order for encoding, Hence can be used for encoding ordinal data.
Code:
Output:
One-Hot Encoding: To overcome the Disadvantage of Label Encoding as it considers some hierarchy in the columns which can be misleading to nominal features present in the data. we can use the One-Hot Encoding strategy.
One-hot encoding is processed in 2 steps:
Code: One-Hot encoding with Sklearn library
Output:
Code: One-Hot encoding with pandas
Output:
This method is preferable since it gives good labels.
Note: One-hot encoding approach eliminates the order but it causes the number of columns to expand vastly. So for columns with more unique values try using other techniques.
Frequency Encoding: We can also encode considering the frequency distribution. This method can be effective at times for nominal features.
Code:
Output:
Ordinal Encoding: We can use Ordinal Encoding provided in Scikit learn class to encode Ordinal features. It ensures that ordinal nature of the variables is sustained.
Code: Using Scikit learn.
Output:
One issue with this representation (Ordinal Encoding) is that the ML algorithm would assume that the two nearby values are closer than the distinct ones.
Example of the above Problem:
Output:
It's looking for the most nearby ones. It assumes that "red" and "green" belong to the same category.
Code: Manually assigning ranking by using a dictionary
Binary Encoding: Initially, categories are encoded as Integer and then converted into binary code, then the digits from that binary string are placed into separate columns.
for eg: for 7 : 1 1 1
This method is quite preferable when there is more categories. Imagine if you have 100 different categories. One hot encoding will create 100 different columns, But binary encoding only need 7 columns.
Code:
Output:
HashEncoding: Hashing is the process of converting of a string of characters into a unique hash value with applying a hash function. This process is quite useful as it can deal with a higher number of categorical data and its low memory usage.
Article regarding hashing
Code:
Output:
You can further drop the converted feature from your Dataframe.
Mean/Target Encoding: Target encoding is good because it picks up values that can explain the target. It is used by most kagglers in their competitions. The basic idea is to replace a categorical value with the mean of the target variable.
Code:
You can further drop the converted feature from your Dataframe.
Output: