Introduction To Adam Optimizer

Last Updated : 19 May, 2026

Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training. It works well with large datasets and complex models because it uses memory efficiently and adapts the learning rate for each parameter automatically.

Working of Adam Optimizer

Adam combines two optimization techniques, Momentum and RMSProp, to achieve faster and more stable training.

1. Momentum

Momentum accelerates gradient descent by using a moving average of past gradients, helping reduce oscillations and speed up convergence. The update rule with momentum is:

where:

is the moving average of the gradients at time
is the learning rate
and are the weights at time and , respectively

The momentum term is updated recursively as:

where:

is the momentum parameter (typically set to 0.9)
is the gradient of the loss function with respect to the weights at time

2. RMSprop (Root Mean Square Propagation)

RMSprop is an adaptive learning rate optimization method that improves AdaGrad by using an exponentially weighted moving average of squared gradients. This prevents the learning rate from decreasing too quickly during training. The update rule for RMSprop is:

where:

is the exponentially weighted average of squared gradients:

is a small constant (e.g., ) added to prevent division by zero

Combining Momentum and RMSprop to form Adam Optimizer

Adam optimizer combines the momentum and RMSprop techniques to provide a more balanced and efficient optimization process. The key equations governing Adam are as follows:

First moment (mean) estimate:

Second moment (variance) estimate:

Bias correction: Since both and are initialized at zero, they tend to be biased toward zero, especially during the initial steps. To correct this bias, Adam computes the bias-corrected estimates:

Final weight update: The weights are then updated as:

Key Parameters

: The learning rate or step size (default is 0.001)
and : Decay rates for the moving averages of the gradient and squared gradient, typically set to and
: A small positive constant (e.g., ) used to avoid division by zero when computing the final update

Performance of Adam Optimizer

Adam delivers strong performance in training deep learning models and large datasets by combining adaptive learning rates with momentum.

Uses adaptive learning rates for each parameter based on past gradients and their magnitudes
Helps reduce oscillations and move past local minima effectively
Applies bias correction to prevent instability during early training stages
Requires less hyperparameter tuning compared to optimizers like SGD
Provides efficient, stable, and reliable optimization across different tasks

👁 Image

Performance Comparison on Training cost

Implementation

Step 1: Install Required Libraries

TensorFlow/Keras is used for building and training neural networks
NumPy handles numerical computations and arrays
Matplotlib is used for visualization and plotting
Scikit-learn provides dataset utilities and preprocessing tools
Run the following command in your terminal

pip install tensorflow numpy matplotlib scikit-learn

Step 2: Import Required Libraries

make_moons() generates a non-linear classification dataset
train_test_split() divides data into training and testing sets
StandardScaler() normalizes the data
Sequential() creates a neural network model
Dense() adds fully connected layers
Adam() applies the Adam optimizer

Step 3: Create Dataset

Generate a synthetic dataset for binary classification.

n_samples=1000 creates 1000 data points
noise=0.2 adds slight randomness for realistic data
random_state=42 ensures reproducibility

Step 4: Split the Dataset

Divide the dataset into training and testing sets.

80% of data is used for training
20% of data is reserved for testing
Helps evaluate model generalization

Step 5: Normalize the Data

Feature scaling improves optimization performance and convergence speed.

Standardization transforms features to zero mean and unit variance
Prevents features with large values from dominating training
Helps Adam optimizer converge faster and more stably

Step 6: Build the Neural Network

Create a neural network using fully connected layers.

First hidden layer contains 16 neurons with ReLU activation
Second hidden layer contains 8 neurons
Output layer uses sigmoid activation for binary classification

Step 7: Compile the Model with Adam Optimizer

Initialize the Adam optimizer with a learning rate.
Compile the neural network with loss function and evaluation metric.

Step 8: Train the Model

Train the neural network using the training dataset.

epochs=50 means the dataset is processed 50 times
batch_size=32 updates weights after every 32 samples
validation_split=0.2 reserves 20% of training data for validation
Training history stores loss and accuracy values

Output:

👁 output2

Training the model

Step 9: Evaluate the Model

Evaluate model performance on unseen test data.

Evaluates how well the model generalizes to new data
Lower loss and higher accuracy indicate better performance

Output:

👁 evaluation2

Evaluation

Download full code from here

Advantages

Uses adaptive learning rates for each parameter based on past gradients
Helps reduce oscillations and escape local minima effectively
Applies bias correction to improve stability during early training stages
Requires less hyperparameter tuning compared to optimizers like SGD
Provides efficient optimization across different machine learning tasks

Limitations

Can sometimes converge to suboptimal solutions compared to SGD
Requires more memory because it stores additional moment estimates
Performance is sensitive to hyperparameter selection in some cases
May generalize less effectively on certain datasets and deep learning tasks
Can struggle with sparse gradients or very noisy optimization landscapes

Comment

Article Tags:

Explore

Basics

Neural Networks

Deep Learning Models

Model Evaluation

Deep Learning Frameworks

Projects

Courses

URL: https://www.geeksforgeeks.org/deep-learning/adam-optimizer/