Analyzing the Criticality of Transient Faults-Induced
SDCs on GPU Applications
Author/Presenters
Event Type
Workshop
Algorithms
Exascale
Resiliency
SIGHPC Workshop
TimeMonday, November 13th4:30pm -
4:50pm
Location607
DescriptionIn this paper, we compare the soft-error sensitivity of
parallel applications on modern GPUs obtained through
architectural-level fault injections and high-energy
particle beam radiation experiments. Fault-injection and
beam experiments provide different information and use
different transient-fault sensitivity metrics, which are
hard to combine. In this presentation, we show how
correlating beam and fault-injection data can provide a
deeper understanding of the behavior of GPUs in the
occurrence of transient faults. In particular, we
demonstrate that commonly used architecture-level fault
models (and fast injection tools) can be used to
identify critical kernels and to associate some
experimentally observed output errors with their causes.
Additionally, we show how register file and
instruction-level injections can be used to evaluate ECC
efficiency in reducing the radiation-induced error
rate.




