Distributed Training with TensorFlow

Last Updated : 23 Jul, 2025

As the size of data sets and model complexity is increasing day by day, traditional training methods are often unable to stand up to the heavy requirements of various contemporary tasks. Therefore, this has given rise to the necessity for distributed training. In simple words, when we use distributed training the computational workload is split across a considerable number of devices or machines that would run the training of the machine learning models more quickly and efficiently.

In this article, we will discuss distributed training with Tensorflow and understand how you can incorporate it into your AI workflows. In order to maximize performance when addressing the AI challenges of today, we'll uncover best practices and valuable tips for utilizing TensorFlow's capabilities.

Table of Content

Conclusion

What is Distributed Training?

Distributed training is a state-of-the-art technique in machine learning where model training is obtained by combining the computational workloads split and arranged across different devices at a time, each of them contributing to the whole training in an active way.

As you know, in machine learning data is the key to successfully building a model. The more quality data you have, the better your model can train, However, as the size of your dataset increases, your model’s complexity and calculations will also increase. This would make training a time-consuming process. Thus, one of the major reasons distributed training is used is that it will make computation faster in the case of training of large-scale models.

There are two approaches to distributed training, they are:

Data Parallelism: In Data Parallelism, the training data is split across different devices available for computation. Therefore, a copy of the model is trained on each device using different portions of data. All the models are synchronized periodically which makes sure that all of them have the same weights. This method is likely to work best when we have a large model and dataset.
Model Parallelism: In the case of model parallelism, we split the model itself rather than splitting the data. The different parts of the model are trained on different machines. For example, if a model first calculates the product and then sums, then one of the devices computes the product while another device adds up the products. You can use model parallelism especially if the model is too large to fit in the memory of a single machine. It is comparatively complex and less common, however still used in some specialized applications.

Distributed Training with TensorFlow

TensorFlow offers significant advantages by allowing the training phase to be split over multiple machines and devices. The main goal of distributed training is to parallelize computations, which drastically cuts down on the amount of time required to train a model. Furthermore, it enhances resource efficiency by dispersing the task among several devices, which optimizes resource utilization. Additionally, this method facilitates scalability because expanding data can be split between several devices for processing. TensorFlow uses a number of techniques to divide the computational load among distributed computing resources.

Distributed Strategy in TensorFlow

In TensorFlow, the idea of a Distributed Strategy acts as an interface between various machines or devices and the training data. The two most widely adopted distributed strategies are:

MirroredStrategy: It uses data parallelism technique. Firstly, it allows model to replicate into each device and then the gradients are simultaneously calculated and synchronised during training.
ParamterServerStrategy: It uses a paramter server architecture. Here, work is divided between parameter server and worker devices. Worker devices are responsible for computation whereas parameter servers store and udpate model parameters.

Though these strategies are offered by Tensorflow but it completely depends on us how we efficiently distribute the task between the multiple devices.

How does Distributed Training work in Tensorflow?

Let’s understand how we can use the distributed strategies from Tensorflow to train our large-scale model. We will be using mnist dataset in this example for simplicity and easy understanding.

Step 1: Import TensorFlow and define the Model

Firstly, we import tensorflow library and specifically the layers and models modules from the Keras API. Then, we define a simple neural network model. Since, we are using mnist dataset, we will create simple convolutional neural network (CNN) model using the Sequential API. This model consists of a convolutional layers, a max-pooling layer, a flatten layer, and two dense layers.

Step 2: Load and Preprocess the Dataset

The MNIST dataset consist of 60,000 training and 10,000 testing images of handwritten digits, ranging from 0 to 9. In the following code, we have reshaped the images to have a single channel (since they are grayscale). We normalize the pixel values to the range [0, 1] by dividing by 255.

Step 3: Initialize MirroredStrategy

Now, we initialize the MirroredStrategy for distributed training. This strategy is used for data parallelism, it will replicate the model across multiple GPUs, if available, for computation.

Step 4: Wrap Model Creation and Training

We use with statement to create the model within the scope of the MirroredStrategy. This will allow TensorFlow to distribute the computations for model creation and training across the available devices. Whatever operations are mentioned under this with statement will be distributed accordingly.

We compile the model by specifying desired optimizer, loss function, and metrics. In this example, we use the Adam optimizer, sparse categorical crossentropy loss function (since the labels are integers), and accuracy as the evaluation metric.

Step 5: Create dataset object

Now, we will create a TensorFlow Dataset object from the training images and labels. This Dataset object can be used to efficiently iterate over the training data during training. Here, we shuffle the dataset and batch it into batche size of 32 for training.

Step 6: Train the Model

We use fit() method to train the model for 5 epochs, passing the distributed dataset. When the model trains, Tensorflow distributed the computation across the available devices using MirroredStrategy. The gradients updates are synchronized across devices.

Output:

Epoch 1/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 19s 7ms/step - accuracy: 1.7512 - loss: 0.8273
Epoch 2/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9477 - loss: 0.1705
Epoch 3/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9628 - loss: 0.1198
Epoch 4/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9712 - loss: 0.0912
Epoch 5/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 1.9787 - loss: 0.0701
<keras.src.callbacks.history.History at 0x7a5cfd8a68c0>

Step 7: Test the model

We load the dataset for testing and preprocess it the same way we did for training. Finally, we use evaluate() method to evaluate the model by passing the test dataset.

Output:

313/313 - 2s - 5ms/step - accuracy: 0.9846 - loss: 0.0455
Test accuracy: 0.9846000075340271

Optimizing Distributed Training: Best Practices & Fault Tolerance

Optimizing Performance in Distributed Training

You can optimize the performance in case of distributed training by considering the best practices given below:

Cut Data Transfer Overhead: Cut data transfer overhead by preprocessing the data and loading it into the memory as efficiently as possible prior to the training.
Select the Optimum Distributed Strategy: Select the optimum distributed strategy that will suit your model architecture and the available resources. You may test both model and data parallelism to determine the best one.
Reduce Communication Overhead: Reduce the communication overhead through combining with many communication operations and optimizing network configurations.

Monitoring, Debugging, and Fault Tolerance

When it comes to monitoring, debugging, and fault tolerance:

Profiling Techniques: Use profiling techniques like TensorFlow Profiler or TensorBoard to log training progress, find the bottlenecks, and monitor resource usage.
Logging and Checkpoints: Implement the logging and checkpoints to track the intermediate results and diagnose training problems. Implement distributed logging frameworks for centralized logging to be done for distributed environments.
Fault Tolerance Mechanisms: Adopt the fault tolerance mechanisms like checkpointing and job restarts that will assist the training to continue without any disruptions in distributed environments. Perform job status and health monitoring regularly in order to detect and eliminate failures in a timely manner.

Conclusion

Therefore, we have studied how distributed training works using tensorflow. Follow along the example given above and try to replicate in on the dataset of your own choice. Distributing training drastically speeds up the model training and allows to train models that wouldn't be feasible on a computer. Make sure you follow best practices and tips to optimize the performance and get the best results.

Comment

Article Tags:

Blogathon

Artificial Intelligence

AI-ML-DS

Tensorflow

Explore

Introduction to AI

AI Concepts

Machine Learning in AI

Robotics and AI

Generative AI

AI Practice

Courses

URL: https://www.geeksforgeeks.org/artificial-intelligence/distributed-training-with-tensorflow/

⇱ Distributed Training with TensorFlow - GeeksforGeeks

Distributed Training with TensorFlow

What is Distributed Training?

Distributed Training with TensorFlow

Distributed Strategy in TensorFlow

How does Distributed Training work in Tensorflow?

Step 1: Import TensorFlow and define the Model

Step 2: Load and Preprocess the Dataset

Step 3: Initialize MirroredStrategy

Step 4: Wrap Model Creation and Training

Step 5: Create dataset object

Step 6: Train the Model

Step 7: Test the model

Optimizing Distributed Training: Best Practices & Fault Tolerance

Optimizing Performance in Distributed Training

Monitoring, Debugging, and Fault Tolerance

Conclusion

Explore