Gradient descent is a optimization algorithm in machine learning used to minimize functions by iteratively moving towards the minimum. It's important as fine-tuning parameters helps us to reduce prediction errors. In this article we are going to explore different variants of gradient descent algorithms.
Batch Gradient Descent is a variant of the gradient descent algorithm where the entire dataset is used to compute the gradient of the loss function with respect to the parameters. In each iteration the algorithm calculates the average gradient of the loss function for all the training examples and updates the model parameters accordingly.
is the gradient of the loss function with respect to .
Python Implementation
Computes the gradient using all training examples.
Averages the gradient over the full dataset.
Updates theta once per epoch.
Suitable for small to medium datasets.
Advantages
Stable Convergence: Since the gradient is averaged over all training examples the updates are less noisy and more stable.
Global View: It considers the entire dataset for each update providing a global perspective of the loss landscape.
Disadvantages
Computationally Expensive: It Processing the entire dataset in each iteration can be slow and resource-intensive especially for large datasets.
Memory Intensive: This requires storing and processing the entire dataset in memory which can be impractical for very large datasets.
2. Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm where the model parameters are updated using the gradient of the loss function with respect to a single training example at each iteration. Unlike batch gradient descent which uses the entire dataset SGD updates the parameters more frequently, leading to faster convergence.
is the gradient of the loss function with respect to for the training example .
Python Implementation
Updates theta using one example at a time.
Leads to faster but noisier updates.
Useful for online learning and large datasets.
More sensitive to learning rate.
Advantages
Faster Convergence: Frequent updates can lead to faster convergence, especially in large datasets.
Less Memory Intensive: Since it processes one training example at a time, it requires less memory compared to batch gradient descent.
Better for Online Learning: Suitable for scenarios where data comes in a stream, allowing the model to be updated continuously.
Disadvantages
Noisy Updates: Updates can be noisy, leading to a more erratic convergence path.
Potential for Overshooting: The frequent updates can cause the algorithm to overshoot the minimum, especially with a high learning rate.
Hyperparameter Sensitivity: Requires careful tuning of the learning rate to ensure stable and efficient convergence.
3. Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. Instead of using the entire dataset or a single training example Mini-Batch Gradient Descent updates the model parameters using a small, random subset of the training data called a mini-batch.
is the gradient of the loss function with respect to θ for the mini-batch.
Python Implementation
Splits data into mini-batches like 32 samples.
Shuffles data for better generalization.
Combines speed of SGD with stability of Batch GD.
Supports parallel computation like GPUs.
Advantages
Faster Convergence: By using mini-batches, it achieves a balance between the noisy updates of SGD and the stable updates of Batch Gradient Descent, often leading to faster convergence.
Reduced Memory Usage: Requires less memory than Batch Gradient Descent as it only needs to store a mini-batch at a time.
Efficient Computation: Allows for efficient use of hardware optimizations and parallel processing, making it suitable for large datasets.
Disadvantages
Complexity in Tuning: Requires careful tuning of the mini-batch size and learning rate to ensure optimal performance.
Less Stable than Batch GD: While more stable than SGD, it can still be less stable than Batch Gradient Descent, especially if the mini-batch size is too small.
Potential for Suboptimal Mini-Batch Sizes: Selecting an inappropriate mini-batch size can lead to suboptimal performance and convergence issues.
Momentum-Based Gradient Descent
Momentum-Based Gradient Descent is an enhancement of standard gradient descent algorithm that aims to accelerate convergence particularly in the presence of high curvature, small but consistent gradients or noisy gradients. It introduces a velocity term that accumulates the gradient of the loss function over time thereby smoothing the path taken by the parameters.