An Efficient Task-Based All-Reduce for Machine Learning
Applications
Author/Presenters
Event Type
Workshop
Deep Learning
Machine Learning
SIGHPC Workshop
TimeMonday, November 13th11am -
11:30am
Location502-503-504
DescriptionAll-Reduce is a collective-combine operation frequently
utilised in synchronous parameter updates in parallel
machine learning algorithms. The performance of this
operation - and subsequently of the algorithm itself -
is heavily dependent on its implementation,
configuration and on the supporting hardware on which it
is run. Given the pivotal role of all-reduce, a failure
in any of these regards will significantly impact the
resulting scientific output.
In this research, we explore the performance of alternative all-reduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 10^8. This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%).
The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speed-ups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets.
In this research, we explore the performance of alternative all-reduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 10^8. This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%).
The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speed-ups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets.




