Back Propagation through time in RNN

Last Updated : 18 May, 2026

Recurrent Neural Networks (RNNs) are designed for sequential data such as text, speech and time series. Unlike traditional neural networks, RNNs use an internal memory (hidden state) so the output depends on both current and previous inputs.

Handles sequential and time-dependent data
Uses hidden states to store information from previous time steps
Captures temporal dependencies across sequences
Uses Backpropagation Through Time (BPTT) for learning
Learns complex sequential patterns from data

RNN Architecture

At each timestep , the RNN maintains a hidden state , that stores information from previous inputs. The hidden state updates by combining the current input and the previous hidden state , applying an activation function to introduce non-linearity. Then the output is generated by transforming this hidden state.

represents the hidden state (memory) at time .
is the input at time
is the output at time
are weight matrices for hidden states, inputs and outputs, respectively.

where and are activation functions.

👁 frame_3331

RNN Architecture

Error Function at Time

To train the network, we measure how far the predicted output is from the desired output using an error function. We use the squared error to measure the difference between the desired output and actual output :

At :

This error quantifies the difference between the predicted output and the actual output at time 3.

Updating Weights Using BPTT

Backpropagation Through Time (BPTT) updates the weights by computing gradients across multiple time steps to minimize error.

1. Adjusting Output Weight

The output weight directly affects the current output , so its update depends only on the current time step.

Using the chain rule:

depends on , so we differentiate w.r.t. .
depends on , so we differentiate w.r.t. .

👁 frame_3334

Adjusting Wy

2. Adjusting Hidden State Weight

The hidden state weight affects both the current and previous hidden states because each hidden state depends on the one before it. Therefore, updating , requires considering how all hidden states influence the output at time step 3.

Gradient Flow Through Hidden States

Start with the error gradient at output .
Propagate gradients back through all hidden states since they affect .
Each depends on , so we differentiate accordingly.

👁 frame_3332

Adjusting Ws

3. Adjusting Input Weight

Similar to , the input weight affects all hidden states because the input at each timestep shapes the hidden state. The process considers how every input in the sequence impacts the hidden states leading to the output at time 3.

The process is similar to , accounting for all previous hidden states because inputs at each timestep affect the hidden states.

👁 frame_3333

Adjusting Wx

Advantages

Captures temporal dependencies across time steps
Learns how past inputs influence future outputs
Forms the foundation for training LSTMs and GRUs
Supports learning from variable-length sequences

Limitations

Gradients may become very small (vanishing gradients), making long-term dependencies difficult to learn
Gradients may grow excessively large (exploding gradients), causing unstable training and updates

Comment

Article Tags:

Machine Learning

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/ml-back-propagation-through-time/