Stochastic Gradient Descent (SGD)

Last Updated : 12 May, 2026

Stochastic Gradient Descent is an optimization algorithm used in machine learning, especially for large datasets, that updates model parameters efficiently using small batches or single samples.

Variant of gradient descent designed for faster and scalable learning
Updates parameters using one data point or small batches at a time
Reduces computation compared to full batch gradient descent
Widely used in deep learning for efficient training

Working of Stochastic Gradient Descent

👁 stochastic

Path followed by batch gradient descent vs. path followed by SGD

In traditional gradient descent, the gradients are computed based on the entire dataset which can be computationally expensive for large datasets.
In Stochastic Gradient Descent, the gradient is calculated for each training example (or a small subset of training examples) rather than the entire dataset.

Stochastic Gradient Descent update rule is:

Where:

and represent the features and target of the i-th training example.
The gradient is now calculated for a single data point or a small batch.

Implementing Stochastic Gradient Descent from Scratch

1. Generating the Data

In this step, we generate synthetic data for the linear regression problem. The data consists of feature X and the target y where the relationship is linear, i.e., y = 4 + 3 * X + noise.

X is a random array of 100 samples between 0 and 2.
y is the target, calculated using a linear equation with a little random noise to make it more realistic.

For a linear regression with one feature, the model is described by the equation:

Where:

is the intercept (the bias term),
is the slope or coefficient associated with the input feature .

2. Defining the SGD Function

This step defines the Stochastic Gradient Descent function that initializes parameters, updates them iteratively, and tracks the loss during training.

Takes input data and target
Itheta () is the parameter vector (intercept and slope) initialized randomly.
X_bias is the augmented with a column of ones added for the bias term (intercept).
Shuffles data in each epoch and updates parameters using single samples or mini-batches
Computes loss using Mean Squared Error (MSE)
Stores loss values to monitor convergence

3: Train the Model Using SGD

In this step, we call the sgd() function to train the model. We specify the learning rate, number of epochs and batch size for SGD.

Output:

👁 training-output

Train the Model Using SGD

4. Visualizing the Cost Function

After training, we visualize how the cost function evolves over epochs. This helps us understand if the algorithm is converging properly.

Output:

👁 file

Visualize the Cost Function

5. Plotting the Data and Regression Line

We will visualize the data points and the fitted regression line after training. We plot the data points as blue dots and the predicted line (from the final ) as a red line.

Output:

👁 Linear-regression-using-SGD-

Plot the Data and Regression Line

6. Printing the Final Model Parameters

After training, we print the final parameters of the model which include the slope and intercept. These values are the result of optimizing the model using SGD.

Output:

Final parameters: [[4.35097872] [3.45754277]]

The final parameters returned by the model are:

Then the fitted linear regression model will be:

This means:

When X=0, y=4.3(the intercept or bias term).
For each unit increase in will increase by 3.4 units (the slope or coefficient).

Applications

Used in deep learning to train large neural networks efficiently
Applied in NLP for models like Word2Vec and transformers
Used in computer vision tasks such as image classification, object detection and segmentation
Applied in reinforcement learning for models like deep Q-networks and policy gradient methods

Advantages

Faster and more efficient since it updates parameters using one or a few data points
Requires less memory, making it suitable for large datasets
Stochastic updates help escape local minima and saddle points
Supports online learning by updating the model with incoming data

Challenges

Updates can be noisy due to using single samples, causing fluctuations in the loss instead of smooth convergence
Highly sensitive to learning rate; too high may diverge, too low slows down learning
May take longer to converge overall despite faster individual updates

Comment

Article Tags:

Machine Learning

python

AI-ML-DS With Python

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/ml-stochastic-gradient-descent-sgd/