Stochastic Gradient Descent is an optimization algorithm used in machine learning, especially for large datasets, that updates model parameters efficiently using small batches or single samples.
Variant of gradient descent designed for faster and scalable learning
Updates parameters using one data point or small batches at a time
Reduces computation compared to full batch gradient descent
Widely used in deep learning for efficient training
Working of Stochastic Gradient Descent
👁 stochastic Path followed by batch gradient descent vs. path followed by SGD
In traditional gradient descent, the gradients are computed based on the entire dataset which can be computationally expensive for large datasets.
In Stochastic Gradient Descent, the gradient is calculated for each training example (or a small subset of training examples) rather than the entire dataset.
Stochastic Gradient Descent update rule is:
Where:
and represent the features and target of the i-th training example.
The gradient is now calculated for a single data point or a small batch.
Implementing Stochastic Gradient Descent from Scratch
1. Generating the Data
In this step, we generate synthetic data for the linear regression problem. The data consists of feature X and the target y where the relationship is linear, i.e., y = 4 + 3 * X + noise.
X is a random array of 100 samples between 0 and 2.
y is the target, calculated using a linear equation with a little random noise to make it more realistic.
For a linear regression with one feature, the model is described by the equation:
Where:
is the intercept (the bias term),
is the slope or coefficient associated with the input feature .
2. Defining the SGD Function
This step defines the Stochastic Gradient Descent function that initializes parameters, updates them iteratively, and tracks the loss during training.
Takes input data and target
Itheta () is the parameter vector (intercept and slope) initialized randomly.
X_bias is the augmented with a column of ones added for the bias term (intercept).
Shuffles data in each epoch and updates parameters using single samples or mini-batches
We will visualize the data points and the fitted regression line after training. We plot the data points as blue dots and the predicted line (from the final ) as a red line.
After training, we print the final parameters of the model which include the slope and intercept. These values are the result of optimizing the model using SGD.
Output:
Final parameters: [[4.35097872] [3.45754277]]
The final parameters returned by the model are:
Then the fitted linear regression model will be:
This means:
When X=0, y=4.3(the intercept or bias term).
For each unit increase in will increase by 3.4 units (the slope or coefficient).
Applications
Used in deep learning to train large neural networks efficiently
Applied in NLP for models like Word2Vec and transformers
Used in computer vision tasks such as image classification, object detection and segmentation
Applied in reinforcement learning for models like deep Q-networks and policy gradient methods
Advantages
Faster and more efficient since it updates parameters using one or a few data points
Requires less memory, making it suitable for large datasets
Stochastic updates help escape local minima and saddle points
Supports online learning by updating the model with incoming data
Challenges
Updates can be noisy due to using single samples, causing fluctuations in the loss instead of smooth convergence
Highly sensitive to learning rate; too high may diverge, too low slows down learning
May take longer to converge overall despite faster individual updates