Failures in Large Scale Systems: Long-Term Measurement,
Analysis, and Implications
Event Type
Paper
State of the Practice
TimeWednesday, November 15th3:30pm -
4pm
Location405-406-407
DescriptionResilience is one of the key challenges in maintaining
high efficiency of future extreme scale supercomputers.
Researchers and system practitioners rely on field-data
studies to understand reliability characteristics and
plan for future HPC systems. While the complexity of
managing system reliability has increased, the number of
studies covering comprehensive quantification and deep
analysis of failures characteristics in large scale
systems has not increased in the same proportion. To
bridge this gap, in this work, we compare and contrast
the reliability characteristics of multiple large-scale
HPC production systems. Our study covers more than one
billion compute node hours across five different systems
over the period of 8 years. We confirm previous findings
which continue to be valid, discover new findings, and
discuss implications of new findings.
Download PDF:
here




