P82: Performance Evaluation of the NVIDIA Tesla P100: Our
Directive-Based Partitioning and Pipelining vs. NVIDIA’s
Unified Memory
SessionPoster Reception
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionWe need simpler mechanisms to leverage the performance
of accelerators, such as GPUs, in supercomputers.
Programming models like OpenMP offer simple-to-use but
powerful directive-based offload mechanisms. By default,
these models naively copy data to or from the device
without overlapping computation. Achieving performance
can require extensive hand-tuning to apply optimizations
such as pipelining. Users must manually partition data
whenever it exceeds device memory. Our directive-based
partitioning and pipelining extension for accelerators
overlaps data transfers and kernel computation without
explicit user data-splitting. We compare a prototype
implementation of our extension to NVIDIA's Unified
Memory on the Pascal P100 GPU and find that our
extension outperforms Unified Memory on average by 68%
for data sets that fit into GPU memory and 550% for
those that do not.




