Resilient N-Body Tree Computations with Algorithm-Based
Focused Recovery: Model and Performance Analysis
Author/Presenters
Event Type
Workshop
Accelerators
Benchmarks
Compiler Analysis and Optimization
Deep Learning
Effective Application of HPC
Energy
Exascale
GPU
I/O
Parallel Application Frameworks
Parallel Programming Languages, Libraries, Models
and Notations
Performance
Simulation
Storage
TimeMonday, November 13th2pm -
2:30pm
Location704-706
DescriptionThis presentation presents a model and performance
study for Algorithm-Based Focused Recovery (ABFR)
applied to N-body computations, subject to latent
errors. We make a detailed comparison with the classical
Checkpoint/Restart (CR) approach. While the model
applies to general frameworks, the performance study is
limited to perfect binary trees, due to the inherent
difficulty of the analysis. With ABFR, the crucial
parameter is the detection interval, which bounds the
error latency. We show that the detection interval has a
dramatic impact on the overhead, and that optimally
choosing its value leads to significant gains over the
CR approach.




