What is Layer Normalization?

Last Updated : 8 Dec, 2025

Layer Normalization stabilizes and accelerates the training process in deep learning. In typical neural networks, activations of each layer can vary drastically which leads to issues like exploding or vanishing gradients which slow down training. Layer Normalization addresses this by normalizing the output of each layer which helps in ensuring that the activations stay within a stable range.

It works by normalizing the input to each neuron such that the mean activation becomes 0 and the variance becomes 1. Unlike Batch Normalization which normalizes over the batch i.e across all samples in the batch, it normalizes over the features for each individual data point.

Working of Layer Normalization

Let's consider an example where we have three vectors:

For each input of the layer, Layer Normalization computes the following:

1. Compute Mean and Variance for Each Feature

Mean and variance are calculated for each input but instead of across the batch, it’s done for the features (i.e per data point):

where, is the number of features (neurons) in the layer, is the input for each feature and and are the computed mean and variance.

Now, let's compute Mean and Variance for each feature (Per Data Point). For :

Mean ():
Variance ():

Similarly computer for and :

👁 Layer-Normalization

Layer Nomalization Example

2. Normalize the Input

Each feature is then normalized using the formula:

Here is a small constant added for numerical stability.

Now, we will normalize each feature in each vector by subtracting the mean and dividing by the standard deviation (square root of the variance) with a small constant added for numerical stability.

For :

We calculate each normalized value for :

For :

We calculate each normalized value for :

For :

We calculate each normalized value for :

3. Apply Scaling and Shifting

To ensure that the normalized activations can still represent a wide range of values, learnable parameters (scaling) and (shifting) are introduced. Final output is computed as:

This allows the network to scale and shift the normalized activations during training.

In real Layer Normalization, and are not scalars. They are vectors of size H (number of features). Each feature has its own learnable and :

For simplicity, the following example uses scalar and only to show the calculation steps. Here let’s assume and . We can apply this scaling and shifting to the normalized values to get the final output for each vector.

For :
For :
For :

These are the exact normalized values and the final outputs after applying Layer Normalization.

Implementation of Layer Normalization in a Simple Neural Network with PyTorch

We will be using Pytorch library for its implementation.

nn.Linear(input_size, output_size): Creates a fully connected layer with the specified input and output dimensions.
nn.LayerNorm(128): Applies Layer Normalization on the input of size 128.
forward(self, x): Defines forward pass for the model by applying transformations to the input x step by step.
torch.randn(10, 64): Generates a tensor of size (10, 64) filled with random values from a normal distribution.
torch.relu(x): Applies ReLU (Rectified Linear Unit) activation function element-wise to x.

Output:

👁 Capture

Res

Advantages of Layer Normalization

Works with Small Batches: It is not dependent on batch size which make it ideal for cases with small batches or when working with reinforcement learning where each input is processed individually.
Stabilizes Learning: It normalizes activations within a layer helping to prevent issues like exploding and vanishing gradients which ensures smoother and more efficient training.
Independent of Batch Size: Unlike Batch Normalization, it is calculated for each individual data point helps in making it more flexible and suitable for situations with variable batch sizes.
Works Well with RNNs: It is useful in Recurrent Neural Networks to maintain stable gradient flow which helps in enhancing performance in sequential tasks.
Faster Convergence: By normalizing activations it helps the model converge faster which allows for quicker training without the need for fine-tuning batch size or learning rate.

Applications of Layer Normalization

Layer Normalization is commonly used in various deep learning architectures:

Recurrent Neural Networks (RNNs): In RNNs, LSTMs and GRUs it is used to stabilize training as Batch Normalization struggles with sequential data. It normalizes activations at each time step to maintain stable gradient flow.
Transformers: Models like BERT and GPT apply Layer Normalization after attention layers to normalize activations which improves the stability and efficiency of training.
Generative Models: In Generative Adversarial Networks it stabilizes the training of both the generator and discriminator networks.
Speech Recognition: It is commonly applied in speech recognition systems to boast performance by normalizing the activations of the model’s recurrent layers.

Layer normalization is effective in scenarios where Batch Normalization would not be practical such as with small batch sizes or sequential models like RNNs. It helps to ensure a smoother and faster training process which leads to better performance across wide range of applications.

Comment

Article Tags: