A Highly Scalable, Algorithm-Based Fault-Tolerant Solver
for Gyrokinetic Plasma Simulations
Author/Presenters
Event Type
Workshop
Algorithms
Exascale
Resiliency
SIGHPC Workshop
TimeMonday, November 13th4:10pm -
4:30pm
Location607
DescriptionWith future exascale computers expected to have
millions of compute units distributed among thousands of
nodes, system faults are predicted to become more
frequent. Fault tolerance will thus play a key role in
HPC at this scale. In this presentation, we focus on
solving the 5-dimensional gyrokinetic Vlasov-Maxwell
equations using the application code GENE as it
represents a high-dimensional and resource-intensive
problem which is a natural candidate for exascale
computing. We discuss the Fault-Tolerant Combination
Technique, a resilient version of the Combination
Technique, a method to increase the discretization
resolution of existing PDE solvers. For the first time,
we present an efficient, scalable and fault-tolerant
implementation of this algorithm for plasma physics
simulations based on a manager-worker model and test it
under very realistic and pessimistic environments with
simulated faults. We show that the Fault-Tolerant
Combination Technique – an algorithm-based forward
recovery method – can tolerate a large number of faults
with a low overhead and at an acceptable loss in
accuracy. Our parallel experiments with up to 32k cores
show good scalability at a relative parallel efficiency
of 93.61%. We conclude that algorithm-based solutions to
fault tolerance are attractive for this type of
problem.




