Training Distributed Deep Recurrent Neural Networks with
Mixed Precision on GPU Clusters
Author/Presenters
Event Type
Workshop
Deep Learning
Machine Learning
SIGHPC Workshop
TimeMonday, November 13th2:40pm -
3pm
Location502-503-504
DescriptionIn this paper, we evaluate training of deep recurrent
neural networks with half-precision floats. We implement
a distributed, data-parallel, synchronous training
algorithm by integrating TensorFlow and CUDA-aware MPI
to enable execution across multiple GPU nodes and making
use of high-speed interconnects. We introduce a learning
rate schedule facilitating neural network convergence at
up to O(100) workers.
Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime scaling and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.
Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime scaling and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.




