Scalable Reduction Collectives with Data
Partitioning-Based Multi-Leader Design
SessionOptimizing MPI
Event Type
Paper
Programming Systems
TimeThursday, November 16th4:30pm -
5pm
Location405-406-407
DescriptionExisting designs for MPI Allreduce do not take
advantage of the vast parallelism available in modern
multi-/many-core processors like Intel Xeon/Xeon Phi or
the increases in communication throughput and recent
advances in high-end features seen with modern
interconnects like InfiniBand and OmniPath. In this
paper, we propose a high-performance and scalable Data
Partitioning-based Multi-Leader (DPML) solution for MPI
Allreduce that can take advantage of the parallelism
offered by multi-/many-core architectures in conjunction
with high throughput and high-end features offered by
InfiniBand and Omni-Path to significantly enhance the
performance of MPI Allreduce on modern HPC systems. We
also model DPML-based designs to analyze the
communication costs theoretically. Microbenchmark level
evaluations show that the proposed DPML-based designs
are able to deliver up to 3.5 times performance
improvement for MPI Allreduce for multiple HPC systems
at scale. At the application-level, up to 35% and 60%
improvements is seen for HPCG and miniAMR
respectively.
Download PDF:
here




