Presentation

· Presenters · Organizations · Search Program

Paper

: Leveraging Near Data Processing for High-Performance Checkpoint/Restart

SessionIn-System Processing for Performance

Authors

Abhinav Agrawal

Gabriel H. Loh

James Tuck

Event Type

Paper

Tags

TimeThursday, November 16th4pm - 4:30pm

Location402-403-404

DescriptionWith the increasing size of HPC systems, the system mean time to interrupt will decrease. This requires checkpoints to be stored in a smaller time when using checkpoint/restart (C/R) for mitigation. Multilevel checkpointing improves C/R efficiency by saving most checkpoints to fast compute-node local storage. But it incurs a high cost for writing a few checkpoints to slow global-I/O. We show that leveraging NDP to offload writing of checkpoints to global-I/O improves C/R efficiency. We explore additional opportunities using NDP to further reduce C/R overhead and evaluate checkpoint compression using NDP as a starting point.

We evaluate the performance of our novel application of NDP for C/R and compare it to existing C/R optimizations. Our evaluation for a projected exascale system using multilevel checkpointing shows that with NDP, the host processor is able to increase its efficiency on an average from 51% to 78% (i.e., a >50% speedup in performance).

Download PDF: here

Authors

Abhinav Agrawal

North Carolina State University

Gabriel H. Loh

Advanced Micro Devices Inc

James Tuck

North Carolina State University

Navigation