Presentation

· Presenters · Organizations · Search Program

Paper

: Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications

SessionState of the Practice: Operations

Authors

Saurabh Gupta

Tirthak Patel

Christian Engelmann

Devesh Tiwari

Event Type

Paper

Tags

TimeWednesday, November 15th3:30pm - 4pm

Location405-406-407

DescriptionResilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. While the complexity of managing system reliability has increased, the number of studies covering comprehensive quantification and deep analysis of failures characteristics in large scale systems has not increased in the same proportion. To bridge this gap, in this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over the period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss implications of new findings.

Download PDF: here

Authors

Saurabh Gupta

Intel Corporation

Tirthak Patel

Northeastern University

Christian Engelmann

Oak Ridge National Laboratory

Devesh Tiwari

Northeastern University

Navigation