A26: Co-Designing MPI Runtimes and Deep Learning
Frameworks for Scalable Distributed Training on GPU
Clusters
SessionPoster Reception
Author
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionDeep Learning frameworks like Caffe, TensorFlow, and
CNTK have brought forward new requirements and
challenges for communication runtimes like MVAPICH2-GDR.
These include support for low-latency and high-bandwidth
communication of very-large GPU-resident buffers. This
support is essential to enable scalable distributed
training of Deep Neural Networks on GPU clusters.
However, current MPI runtimes have limited support for
large-message GPU-based collectives. To address this, we
propose the S-Caffe framework; a co-design of
distributed training in Caffe and large-message
collectives in MVAPICH2-GDR. We highlight two designs
for MPI_Bcast, one that exploits NVIDIA NCCL and the
other that exploits ring-based algorithms. Further, we
present designs for MPI_Reduce that provide up-to 2.5X
improvement. We also present layer-wise gradient
aggregation designs in S-Caffe that exploit overlap of
computation and communication as well as the proposed
reduce design. S-Caffe provides a scale-out to 160 GPUs
for GoogLeNet training and delivers performance
comparable to CNTK for AlexNet training.




