When a single GPU’s compute power is the bottleneck but the model fits comfortably in its memory, the simplest way to scale is data parallelism: replicate the full model on every GPU, give each replica a different slice of the mini-batch, and synchronize gradients after the backward pass.

Data parallelism is the most widely used multi-GPU strategy because it requires no model changes and scales close to linearly up to a point. This article explains the training step, the all-reduce primitive that makes it work, and the limits of the technique.

1. Introduction

A single high-end GPU can train models with tens of millions of parameters in reasonable time, but real-world datasets are growing faster than per-GPU compute. When training a fixed model on a large dataset takes hours or days, the natural question is whether multiple GPUs can finish the job in a fraction of the time.

Data parallelism answers yes — with a simple recipe. Every GPU runs the same model on different data, and a synchronization step keeps the replicas in lockstep. The result behaves, mathematically, like a single training run on a much larger batch size.

2. The Training Step Across N GPUs

With N GPUs, every training step performs the following:

Split a mini-batch of size B into N shards of size B/N, one per GPU.
Each GPU runs forward and backward independently on its shard, producing a local gradient.
The local gradients are summed across all GPUs — this is the all-reduce step.
Every GPU now holds the same summed gradient and applies the optimizer update locally.

Because every replica starts the step with identical parameters and ends with the same summed gradient, parameters stay in sync automatically — no explicit broadcasting is required after the first step.

3. All-Reduce: The Critical Primitive

All-reduce is the collective communication operation that sums each gradient tensor across all GPUs and distributes the result back to every GPU. It is implemented efficiently by NVIDIA’s NCCL library over NVLink, PCIe, or InfiniBand using a ring algorithm:

The gradient buffer is split into N chunks.
Each GPU sends one chunk to its neighbor while receiving another — in 2(N − 1) steps, every GPU ends up with the full sum.
Bandwidth per GPU stays roughly constant as N grows; the total exchanged volume scales linearly with the number of GPUs.

For a modern GPU cluster over NVLink, all-reduce of a few hundred megabytes of gradients takes a few milliseconds — small compared to a step’s compute time, but not free. As the number of GPUs grows, all-reduce eventually becomes the limiting factor.

4. Effective Batch Size

Data parallelism multiplies the effective batch size: eight GPUs each processing 32 samples behave like a single device processing 256. This has two consequences:

Throughput scales near-linearly with GPU count, up to the point where all-reduce time starts to dominate.
Convergence dynamics change. Very large effective batches need a tuned learning rate (linear scaling rule, LARS, LAMB) to match small-batch accuracy.

This second point is often underappreciated. Simply adding more GPUs without adjusting the learning rate can yield faster wall-clock training but worse final accuracy. The relationship between batch size and learning rate is one of the most studied topics in large-scale deep learning.

5. When Data Parallelism Fits

Data parallelism is the right choice when three conditions are met:

The model fits on a single GPU, including parameters, activations, and optimizer state.
The dataset is large enough that a multiplied effective batch size still makes sense.
The workload is compute-bound rather than communication-bound on the available interconnect.

For models in the tens-of-millions-of-parameters range trained on large datasets, this combination is the rule rather than the exception. It is the default scaling strategy for everything from image classifiers to mid-sized language models.

6. When It Stops Scaling

Three limits stop data parallelism from scaling indefinitely:

Communication saturation. All-reduce time approaches step compute time — typically beyond eight to sixteen GPUs on commodity interconnects, later with NVLink-connected systems.
Batch-size ceiling. Effective batch size grows past what the optimizer can handle without accuracy loss.
Model size. The model itself becomes too big to fit on one GPU, at which point data parallelism alone cannot help — model parallelism is required.

The first two limits are tunable: faster interconnects and better large-batch optimizers push them further. The third is hard — once a model exceeds single-GPU memory, the technique simply no longer applies.

7. Conclusions

Data parallelism is the cheapest and simplest form of multi-GPU scaling:

Every GPU holds a full copy of the model and processes a different shard of the batch.
An all-reduce sum after the backward pass keeps replicas in sync — no other coordination is needed.
It scales well until communication overhead or batch-size limits dominate.

For models that fit on a single GPU, data parallelism should be the first scaling strategy tried. Only when its limits are reached is the additional complexity of model parallelism justified.

URL: https://www.opennn.net/tutorials/multi-gpu-data-parallelization/

⇱ Multi-GPU Data Parallelization – OpenNN