A27: High-Performance and Scalable Broadcast Schemes for
Deep Learning on GPU Clusters
SessionPoster Reception
Author
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionBroadcast operations are a widely used operation in
many streaming and deep learning applications to
disseminate large amounts of data on emerging
heterogeneous High-Performance Computing (HPC) systems.
Further, traditional broadcast schemes are not well
optimized for upcoming large-scale Graphics Processing
Unit (GPU)-based systems. However, utilizing
cutting-edge features of modern HPC technologies such
like InfiniBand (IB) and NVIDIA GPUs to enable scalable
heterogeneous broadcast operations remains an open
challenge.
Toward delivering the best performance for streaming and deep learning workloads, we propose high-performance and scalable broadcast schemes that exploit IB hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology. We present experimental results and find that they indicate improved scalability and up to 68% reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, the proposed design yields up to 24% performance improvement for the popular deep learning framework, Microsoft cognitive toolkit (CNTK), with no application changes.
Toward delivering the best performance for streaming and deep learning workloads, we propose high-performance and scalable broadcast schemes that exploit IB hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology. We present experimental results and find that they indicate improved scalability and up to 68% reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, the proposed design yields up to 24% performance improvement for the popular deep learning framework, Microsoft cognitive toolkit (CNTK), with no application changes.




