Gravel: Fine-Grain GPU-Initiated Network Messages
SessionGPUs and Communication
Event Type
Paper
Accelerators
Communication Optimization
System Software
TimeWednesday, November 15th11am -
11:30am
Location402-403-404
DescriptionDistributed systems incorporate GPUs because they
provide massive parallelism in an energy-efficient
manner. Unfortunately, existing programming models make
it difficult to route a GPU-initiated network message.
The traditional coprocessor model forces programmers to
manually route messages through the host CPU. Other
models allow GPU-initiated communication, but are
inefficient for small messages.
To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize shared-memory synchronization across the GPU’s data-parallel lanes.
Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show that Gravel is more programmable and usually performs better than prior GPU networking models.
To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize shared-memory synchronization across the GPU’s data-parallel lanes.
Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show that Gravel is more programmable and usually performs better than prior GPU networking models.
Download PDF:
here




