Gradient Descent Algorithm in Machine Learning

Last Updated : 12 May, 2026

Gradient Descent is an optimization algorithm used to minimize the error of a machine learning model by updating parameters in the direction of decreasing loss.

👁 gradient_intuition

Try to reach global minima where error is minimum₹

Minimizes the loss or error of a model
Updates parameters step by step using gradients
Helps models like neural networks and regression learn optimal weights
Variants include Batch Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent

Formula:

1. Linear Regression

Linear Regression is a supervised learning algorithm used to predict continuous numerical values by modeling the relationship between input variables and the output using a best-fit line.

Predicts continuous values (e.g., price, temperature)
Models relationship using a straight line
Minimizes prediction error using Mean Squared Error (MSE)
Simple and widely used regression technique

Where:

: input feature
: weight (slope of the line)
: bias/intercept
: predicted output
: true target value
: number of samples
: MSE loss

Role of Gradient Descent

Gradient Descent helps the Linear Regression model find the best values of weight and bias so that the prediction error becomes as small as possible. It starts with random values and gradually adjusts them in the direction that reduces the loss.

👁 mean_squared_error_mse_

Gradient Descent

Calculates how the loss changes with respect to weight and bias
Updates these parameters step by step to reduce prediction error
Repeats the process until the loss becomes very small and the best line is found

Implementation

Creates a dataset: Generates 200 sample data points with some random noise.
Initializes parameters: Starts with initial values for weight and bias, and sets the learning rate and number of iterations.
Trains the model: In each iteration, the model makes predictions, calculates the error and updates the parameters using Gradient Descent.
Tracks the loss: Stores the loss value in every iteration to observe how the error changes.
Plots the loss graph: Shows how the error decreases over iterations.
Plots the fitted line: Displays the data points along with the final regression line learned by the model.

Output:

2. Logistic Regression

Logistic Regression is a supervised learning algorithm used for binary classification, predicting the probability that a data point belongs to a specific class.

Used for binary classification tasks
Outputs probabilities between 0 and 1 using the sigmoid function
Converts probabilities into class labels (e.g., 0 or 1)
Trained by minimizing binary cross-entropy loss
Simple and effective for classification problems

Where:

: linear combination
: sigmoid output
: predicted probability
: actual label
: number of samples
: cross-entropy loss

Role of Gradient Descent

Gradient descent helps logistic regression find optimal parameter values by reducing prediction error over time.

Computes how cross-entropy loss changes with respect to model parameters
Updates parameters step by step to improve probability predictions
Iteratively reduces loss until the model converges to better performance

Implementation

Creates a dataset: Generates sample data with two classes.
Trains the model using Gradient Descent: The model predicts probabilities, calculates the error using cross entropy loss and updates the weights and bias to reduce the loss.
Shows learning progress: Plots a graph of loss vs iterations to show how the error decreases during training.
Shows the final classification boundary: Displays the data points and the decision boundary learned by the model.

Output:

3. Softmax Regression

Softmax Regression is an extension of logistic regression used for multi-class classification, predicting probabilities across multiple classes.

Used for multi-class classification problems
Outputs a probability distribution over all classes
Uses the softmax function to convert scores into probabilities
Selects the class with the highest probability as the prediction
Trained by minimizing categorical cross-entropy loss

Where:

: score(logit) for class k for sample i
: probability that sample i belongs to class k
: number of classes
: one-hot encoded true label
: number of samples
: multiclass cross-entropy loss

Role of Gradient Descent

Gradient descent helps the softmax regression model learn optimal parameters by reducing prediction error and improving class probabilities.

Computes how multiclass cross-entropy loss changes with respect to the weights
Updates weights step by step to improve probability of the correct class
Iteratively reduces loss until the model achieves better performance

Implementation

Creates a dataset: Generates sample data with three different classes.
Trains the model using Gradient Descent: The model calculates class probabilities using the softmax function, computes the error using cross entropy loss and updates the weights and bias to reduce the loss.
Shows learning progress: Plots a graph of loss vs iterations to show how the error decreases during training.
Shows the final class regions: Displays the data points and the regions learned by the model for each class.

Output:

4.Neural Network

A neural network with one hidden layer can learn complex patterns that linear models cannot capture by introducing non-linearity through activation functions.

Uses a hidden layer to model complex relationships
Applies activation functions like ReLU for non-linearity
Uses softmax in the output layer to predict class probabilities
Trained using the backpropagation algorithm

Hidden Layer:

Output Layer:

Loss:

Where:

: weight matrices
: bias vectors
: hidden layer activations
: input vector
: softmax output probabilities
: true class label (one-hot)
: number of classes
: cross-entropy loss

Role of Gradient Descent

Gradient descent helps a neural network learn optimal weights by reducing prediction error, working together with backpropagation to improve performance.

Uses gradients from backpropagation to update weights in each layer
Adjusts weights step by step to reduce prediction error
Iteratively improves learning until the model captures better patterns

Implementation

Creates the dataset and neural network: Generates data with three classes and builds a neural network with one hidden layer.
Trains the model: The network makes predictions, calculates the loss and updates the weights using backpropagation and Gradient Descent.
Shows the results: Plots the training loss over epochs and displays the decision regions learned by the neural network.

Output:

5. Support Vector Machine

A Support Vector Machine is a supervised learning algorithm used for classification and regression that finds the best boundary to separate data points.

Creates a decision boundary (hyperplane) to separate classes
Maximizes the margin between different classes
Uses support vectors (critical data points) to define the boundary
Applies hinge loss to penalize incorrect or close predictions

Where:

: weight vector
: bias term
: input
: class label
: regularization strength
: hinge loss
: total SVM objective

Role of Gradient Descent

Gradient Descent helps the SVM model find the best parameters so that the classification boundary separates the classes as clearly as possible. It adjusts the parameters by reducing hinge loss and improving the margin between classes.

Updates the model parameters to reduce hinge loss
Adjusts the boundary when points are misclassified or too close to it
Repeats the process to improve the margin and classification accuracy

Implementation

Creates a dataset: Generates sample data with two classes and converts labels into values suitable for SVM training.
Trains the SVM using Gradient Descent: Calculates hinge loss, updates the model parameters step by step and improves the decision boundary during training.
Shows the results: Plots the objective value over iterations and displays the final decision boundary separating the two classes.

Output:

6. Matrix Factorization

Matrix Factorization is used in recommender systems to find hidden patterns in user item ratings. It breaks the rating matrix into smaller latent factors representing user preferences and item features, which are used to predict missing ratings.

Reconstruction:

Loss:

Where:

: rating matrix
: user latent matrix
: item latent matrix
: observed user-item pairs
: actual rating
: predicted rating
: regularization strength
: reconstruction loss

Role of Gradient Descent

Gradient Descent helps the Matrix Factorization model learn the best user and item factors so that the predicted ratings become closer to the actual ratings. It:

Updates user and item latent vectors using observed ratings
Reduces the difference between predicted and actual ratings
Repeats the process to learn better user and item representations

Implementation

Creates a rating dataset: Generates a user item rating matrix and keeps only some ratings as observed data.
Trains the model using Gradient Descent: Updates user and item latent factors using the observed ratings to reduce prediction error.
Shows learning progress: Plots the loss over epochs to show how the reconstruction error decreases.
Shows the reconstructed matrix: Displays the observed ratings and the reconstructed full matrix learned by the model.

Output:

Download code for all implementations from here

Comment

Article Tags:

Machine Learning

Deep Learning

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/gradient-descent-algorithm-and-its-variants/