Feature Scaling and Normalisation in a Nutshell
Why, How and When to re-scale your features
One of the most fundamental steps in machine learning is probably feature engineering, during which we try to craft as predictive features as possible. Once we manage to get there, we probably end up with a bunch of features of significantly different nature. So what is the effect of this irregularity in the model performance and how can we deal with it?
Feature Engineering is the process of creating predictive features that can potentially help Machine Learning models achieve a desired performance. In most of the cases, features will be measurements of different unit and range of values. For instance, you might consider adding to your feature space the age of your employees – that could theoretically take values between 1 and 100 – and also their compensation which could range between a few thousands to a few millions.
In this article, I am going to introduce Feature Scaling, a pre-processing technique that handles cases where our ML models require scaled features for optimal results.
What is wrong with having features of different scale and range of values in the dataset?
Having features varying in scale and range could be an issue when the model we are trying to build uses distance measures such as Euclidean Distance. Such models could be K-Nearest Neighbours, K-Means Clustering, Learning Vector Quantization (LVQ) etc.
Principal Component Analysics (PCA) is also a good example of when feature scaling is important since we are interested in the components that maximize the variance and therefore we need to ensure that we are comparing apples to apples.
Furthermore, feature scaling can also help models that use Gradient Descent as their optimisation algorithm – since feature standardisation helps reach convergence much faster.
On the other hand, feature scaling is not required (and thus not effective when applied) for models that don’t take a distance-based approach. These include tree-based models such as Decision Trees and Random Forests.
What is Feature Scaling?
Feature scaling is the process of scaling the values of features in a dataset so that they proportionally contribute to the distance calculation. The two most commonly used feature scaling techniques are Standardisation (or Z-Score Normalisation) and Min-Max scaling.
Standardisation (also known as Z-Score Normalisation/Standardisation) is the process of rescaling features χ so that they have μ=0 and σ=1. Technically, standardisation centres and normalises the data by subtracting the mean and dividing by the standard deviation. The resulting values are called standard score (or z-score) and can be computed as follows:
In Python and scikit-learn this would probably translate to
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)
Min-Max Scaling is the process of rescaling feature values into a particular range (for example [0, 1]). The formula for scaling the values into a range -σbetween [a, b] is given below+ -(m:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)
At what modelling step do we apply feature scaling?
It is important to mention that before applying any sort of data normalisation, we first need to split our initial dataset into training and testing sets. Don’t forget that testing data points represent real-world data.
As mentioned earlier, mean and standard deviation is taken into account when standardising our data. If we take the mean and variance of the whole dataset then we will be introducing future information into the training explanatory variables. Therefore, we should perform feature scaling over the training data and then perform normalisation on testing instances as well, but this time using the mean and standard deviation of training explanatory variables. In this way, we can test and evaluate whether our model can generalise well to new, unseen data points.
Exploring the impact of Feature Scaling over Wine Recognition Dataset
Now let’s assume that we want to perform Principal Component Analysis (PCA) over the UCI ML Wine recognition dataset.
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.The analysis determined the quantities of 13 constituents found in each of the three types of wines.
The features are Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines and Proline. The goal is to predict the cultivar that could be one of _class0, _class1 and _class2.
For the sake of this example, we are going to skip the feature scaling step at first and observe the results when no pre-processing step is taken. Then, we will repeat the same procedure but this time using feature scaling and finally compare the results.
Step 1: Load the data
We load the data and separate our features from their respective target variables:
from sklearn.datasets import load_wine
features, target = load_wine(return_X_y=True)
Step 2: Split initial dataset into training and testing sets
As mentioned earlier, before taking a pre-processing step we first need to split our dataset into training and testing tests. The former will be used for model training and the latter for evaluating the performance of the trained model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.3, random_state=42
)
Step 3: Scale the data
Now we need to scale the data so that we fit the scaler and transform both training and testing sets using the parameters learned after observing training examples.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 4: Apply Dimensionality Reduction using PCA
Now we can perform Principal Component Analysis. For the sake of ease, I am gonna use two components so that it’s easier to visualise the results later on in a two-dimensional space:
pca = PCA(n_components=2)
X_train_dim_red = pca.fit_transform(X_train_scaled)
X_test_dim_red = pca.transform(X_test_scaled)
Now we can quickly visualise the training instances after scaling and performing dimensionality reduction (the link to the code is given below):
Below, the same plot is generated but this time no feature scaling was applied:
Step 5: Train and evaluate a model
Finally, we can fit a Gaussian Naive Bayes model and evaluate the performance of the model on testing instances:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train_dim_red, y_train)
predictions = model.predict(X_test_dim_red)
print(f'Model Accuracy: {accuracy_score(y_test, predictions):.2f}')
>>> Model Accuracy: 0.98
The model accuracy hits 98% on testing instances. In case no scaling is applied, the test accuracy drops to 0.81%.
The full code is available on Github as a Gist.
Conclusion
Feature scaling is one of the most fundamental pre-processing steps that we need to consider before training machine learning models. As we already discussed, we need to understand whether feature scaling is required. This is dependent to the model we aim to build (for example tree based models don’t require any sort of feature scaling) and the nature of our feature values.
Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS