Leveraging Near Data Processing for High-Performance
Checkpoint/Restart
Authors
Event Type
Paper
TimeThursday, November 16th4pm -
4:30pm
Location402-403-404
DescriptionWith the increasing size of HPC systems, the system
mean time to interrupt will decrease. This requires
checkpoints to be stored in a smaller time when using
checkpoint/restart (C/R) for mitigation. Multilevel
checkpointing improves C/R efficiency by saving most
checkpoints to fast compute-node local storage. But it
incurs a high cost for writing a few checkpoints to slow
global-I/O. We show that leveraging NDP to offload
writing of checkpoints to global-I/O improves C/R
efficiency. We explore additional opportunities using
NDP to further reduce C/R overhead and evaluate
checkpoint compression using NDP as a starting point.
We evaluate the performance of our novel application of NDP for C/R and compare it to existing C/R optimizations. Our evaluation for a projected exascale system using multilevel checkpointing shows that with NDP, the host processor is able to increase its efficiency on an average from 51% to 78% (i.e., a >50% speedup in performance).
We evaluate the performance of our novel application of NDP for C/R and compare it to existing C/R optimizations. Our evaluation for a projected exascale system using multilevel checkpointing shows that with NDP, the host processor is able to increase its efficiency on an average from 51% to 78% (i.e., a >50% speedup in performance).
Download PDF:
here




