Training a large-scale deep neural network in a large-scale dataset is
challenging and time-consuming. The recent breakthrough of large-batch
optimization is a promising way to tackle this challenge. However, although the
current advanced algorithms such as LARS and LAMB succeed in classification
models, the complicated pipelines of dense visual predictions such as object
detection and segmentation still suffer from the heavy performance drop in the
large-batch training regime. To address this challenge, we propose a simple yet
effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which
can train dense visual predictors with very large batch size, enabling several
benefits more appealing than prior arts. Firstly, AGVM can align the gradient
variances between different modules in the dense visual predictors, such as
backbone, feature pyramid network (FPN), detection, and segmentation heads. We
show that training with a large batch size can fail with the gradient variances
misaligned among them, which is a phenomenon primarily overlooked in previous
work. Secondly, AGVM is a plug-and-play module that generalizes well to many
different architectures (e.g., CNNs and Transformers) and different tasks
(e.g., object detection, instance segmentation, semantic segmentation, and
panoptic segmentation). It is also compatible with different optimizers (e.g.,
SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive
experiments on the COCO and ADE20K datasets demonstrate the superiority of
AGVM. For example, it can train Faster R-CNN+ResNet50 in 4 minutes without
losing performance. AGVM enables training an object detector with one billion
parameters in just 3.5 hours, reducing the training time by 20.9x, whilst
achieving 62.2 mAP on COCO. The deliverables are released at
https://github.com/Sense-X/AGVM.