Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning to minimize a model loss function by iteratively updating its parameters. Instead of computing gradients using the entire dataset, SGD updates model weights using a single randomly selected data point or a small batch which makes the training process faster and more scalable.
Updates parameters using one sample (or a small batch), reducing computation required in each iteration and making it suitable for large datasets.
Frequent parameter updates allow the model to learn faster and explore the parameter space more effectively.
Commonly used in training neural networks where weights and biases are updated iteratively to minimize the loss function.
Introduces randomness during training, which can help the model escape local minima and improve generalization.
π stochastic Path followed by batch gradient descent vs. path followed by SGD
Here in above image Gradient Descent follows a smooth and direct path toward the minimum, SGD takes a noisier and zig-zag path due to updates based on individual data samples.
Working
Stochastic Gradient Descent (SGD) minimizes a loss function by iteratively updating model parameters using the gradient computed from a randomly selected training example. SGD performs updates after evaluating a single data point or a small mini-batch.
The goal of SGD is to minimize the loss function, which measures how far the modelβs prediction is from the actual value.. The objective function defined as:
where
: total number of training samples
: actual value
: predicted value
: loss function
2. Gradient Calculation
To reduce the loss, we calculate the gradient, which tells us how much the parameters should change.
where
represents the partial derivative.
represents the model weight (parameter).
represents how the loss changes with respect to the weight.
3. Parameter Update Rule
Once the gradient is calculated, the model parameters are updated using the SGD update rule:
where is learning rate.
4. SGD Training Algorithm
The SGD algorithm repeats the following steps until the model converges:
Initialize the model parameters (weights and bias) with random values and choose a learning rate .
Randomly shuffle the training dataset to ensure the model learns unbiased patterns.
Select one training example and compute the predicted output .
Calculate the loss and compute the gradient of the loss with respect to the model parameters.
Update the parameters using the SGD update rule and repeat the process for multiple iterations until the model converges, meaning the loss becomes stable or sufficiently small.
Step By Step Implementation
Here we implement SGD algorithm in R.
Step 1: Generate and Visualize Synthetic Dataset
Here we create a synthetic dataset with a linear relationship between x and y and add random noise. A scatter plot is used to visualize the data distribution.
Step 2: Initialize Model Parameters and Training Settings
We randomly initialize the model weight and bias, set the learning rate, define the number of training epochs and create a vector to store the loss at each epoch.
Step 3: Train Model Using Stochastic Gradient Descent
Here we train the linear regression model by updating the weight and bias for each training example. The dataset is shuffled every epoch and the loss is recorded to track training progress.
Step 4: Display Final Model Parameters
After training, we print the final weight and bias learned by the model to see the values that best fit the dataset.
Output:
Final Weight: 3.3142
Final Bias: 4.718445
Step 5: Visualize Regression Line
Plot the training data points and draw the regression line using the final weight and bias learned by SGD to see how well the model fits the data.
The performance and convergence of Stochastic Gradient Descent (SGD) depend on the choice of its hyperparameters. Choosing the right values for these parameters is critical for efficient training and achieving optimal model performance.
1. Learning Rate
The learning rate () controls the step size the algorithm takes towards minimizing the loss function. It is one of the most sensitive hyperparameters in SGD.
Too High: The algorithm may overshoot the minimum or diverge entirely.
Too Low: Convergence becomes very slow and the model may get stuck in local minima.
Strategy: Start with a small value such as 0.01. You can also use learning rate schedules or adaptive learning rate methods (like Adam or RMSProp) to adjust it dynamically during training.
2. Batch Size
The batch size determines how many data points are used to compute the gradient in each update step.
Stochastic (Batch Size = 1): Each update uses a single data point, introducing more noise but enabling faster iterations.
Mini-Batch: Balances computational efficiency and noise, making it the most commonly used approach in practice.
Full Batch: Uses the entire dataset for each update. This reduces gradient noise but increases computation time per iteration.
Strategy: Mini-batches are preferred for large datasets. Using smaller batches can also act as a regularizer due to the inherent noise in gradient estimation.
3. Epochs
An epoch is a complete pass through the entire dataset. The number of epochs determines how many times the model sees the data during training.
Too Few Epochs: The model may underfit, failing to capture patterns in the data.
Too Many Epochs: The model may overfit, learning noise rather than meaningful patterns.
Strategy:Start with 100β500 epochs and monitor the loss or validation performance. Early stopping can be applied to halt training when performance stops improving.
Letβs consider a dataset where we want to perform Hyperparameters on Stochastic Gradient Descent In R.
Step 1:Load Necessary Libraries and Generate Synthetic Data
This step installs and loads the required library (ggplot2) and generates a synthetic 2D dataset for logistic regression.
The output is showing the optimized weights and bias of the logistic regression model and visualize the decision boundary separating the two classes in the plot.
Advantages
Updates parameters using one sample at a time, making each iteration faster than full batch gradient descent.
Frequent updates help the model explore the parameter space more effectively, often leading to faster convergence.
Requires less memory since it processes small batches or single samples, making it suitable for large datasets.
The randomness in updates helps escape shallow local minima and find better solutions.
Widely used in deep learning due to its efficiency and scalability.
Limitation
Random updates introduce noise, causing fluctuations in the loss and less stable convergence.
Highly sensitive to learning rate; improper values can lead to slow training or divergence.
May take more epochs to reach the optimal solution due to noisy updates.
Performs poorly on sparse or unbalanced datasets where single samples may not represent overall patterns.
Without enhancements like momentum or adaptive methods, it may struggle with complex loss surfaces.