An In-Depth Performance Characterization of CPU- and
GPU-Based DNN Training on Modern Architectures
Author/Presenters
Event Type
Workshop
Deep Learning
Machine Learning
SIGHPC Workshop
TimeMonday, November 13th4:18pm -
4:42pm
Location502-503-504
DescriptionTraditionally, Deep Learning (DL) frameworks like
Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs
to accelerate the training process. This has been
primarily achieved by aggressive improvements in
parallel hardware as well as through sophisticated
software frameworks like cuDNN and cuBLAS. However,
recent enhancements to CPU-based hardware and software
has the potential to significantly enhance the
performance of CPU-based DL training. In this paper, we
provide a complete performance landscape of CPU- and
GPU-based DNN training. We characterize performance of
DNN training for AlexNet and ResNet-50 for a wide-range
of CPU and GPU architectures including the latest Intel
Xeon Phi (Knights Landing) processors and NVIDIA Pascal
GPUs. We also present multi-node DNN training
performance results for AlexNet and ResNet-50 using
Intel Machine Learning Scaling (MLSL) Library and
Intel-Caffe. In addition, we provide a CPU vs. GPU
comparison for multi-node training using OSU-Caffe and
Intel-Caffe. To the best of our knowledge, this is the
first study that dives deeper into the performance of
DNN training in a holistic manner yet provides an
in-depth look at layer-wise performance for different
DNNs.
We provide multiple key insights: 1) Convolutions account for the majority of time (up to 83% time) consumed in DNN training, 2) GPU-based training continues to deliver excellent performance (up to 18% better than KNL) across generations of GPU hardware and software, and 3) Recent CPU-based optimizations like MKL-DNN and OpenMP-based thread parallelism leads to excellent speed-ups over under-optimized designs (up to 3.2X improvement for AlexNet training).
We provide multiple key insights: 1) Convolutions account for the majority of time (up to 83% time) consumed in DNN training, 2) GPU-based training continues to deliver excellent performance (up to 18% better than KNL) across generations of GPU hardware and software, and 3) Recent CPU-based optimizations like MKL-DNN and OpenMP-based thread parallelism leads to excellent speed-ups over under-optimized designs (up to 3.2X improvement for AlexNet training).




