![]() |
VOOZH | about |
Categorical data refers to variables that belong to distinct categories such as labels, names or types. Since most machine learning algorithms require numerical inputs, encoding categorical data to numerical data becomes important. Proper encoding ensures that models can interpret categorical variables effectively, leading to improved predictive accuracy and reduced bias.
1. Nominal Data:Nominal data consists of categories without any inherent order or ranking. These are simple labels used to classify data.
2. Ordinal Data: Ordinal data includes categories with a defined order or ranking, where the relationship between values is important.
Using the right encoding techniques, we can effectively transform categorical data for machine learning models which improves their performance and predictive capabilities.
Label Encoding assigns each category a unique integer. It is simple and memory-efficient but may unintentionally imply an order among categories when none exists.
Let's look at the following example:
Output:
Encoded Data: [0 1 2 0]
Here, 'Red' becomes 0, 'Green' becomes 1 and 'Blue' becomes 2.
One-Hot Encoding converts categories into binary columns with each column representing one category. It prevents false ordering but can lead to high dimensionality if there are many unique values.
Let's look at the following example:
Output:
Each unique category ('Red', 'Blue', 'Green') is transformed into a separate binary column, with 1 representing the presence of the category and 0 its absence.
Ordinal Encoding maps categories to integers while preserving their natural order. This works well for ordered data like ratings but is not suitable for nominal variables.
Let's consider the following example:
Output:
In this case, 'Low' is encoded as 0, 'Medium' as 1 and 'High' as 2, preserving the natural order of the categories.
Target Encoding also known as Mean Encoding is a technique where each category in a feature is replaced by the mean of the target variable for that category.
Let's consider the following example:
Output:
In this case, each color is encoded based on the mean of the target variable. For instance, 'Red' has a mean target value of approximately 0.485, which reflects the target values for the rows where 'Red' appears.
Binary encoding represents categories as binary codes and splits them across multiple columns. It is efficient for high-cardinality data but slightly more complex to implement.
Let's consider the following example:
Output:
Here, each category (like 'Red', 'Blue', 'Green') is converted into binary digits. 'Red' gets the binary code '10', 'Blue' becomes '01' and 'Green' becomes '11'. Each binary digit is placed in a separate column (e.g., Color_0 and Color_1).
Frequency Encoding assigns categories values based on how often they occur in the dataset. It is simple and compact but can introduce data leakage if applied improperly.
Let's consider the following example:
Output:
Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]
Here, 'Red' appears 3 times, so it is encoded as 3, while 'Green' and 'Blue' appear once, so they are encoded as 1.
| Technique | Suitable For | Dimensionality | Overfitting Risk | Interpretability |
|---|---|---|---|---|
| One-Hot Encoding | Nominal | High | Low | High |
| Label Encoding | Ordinal (sometimes Nominal) | Low | Medium | Medium |
| Ordinal Encoding | Ordinal | Low | Medium | High |
| Binary Encoding | High-cardinality features | Medium | Medium | Medium |
| Frequency Encoding | High-cardinality | Low | High | Medium |
| Target Encoding | High-cardinality | Low | High | Low-Medium |