Presentation

· Presenters · Organizations · Search Program

ACM Student Research Competition

Poster

Reception

: P89: Desh: Deep Learning for HPC System Health Resilience

SessionPoster Reception

Authors

Anwesha Das

Abhinav Vishnu

Charles Siegel

Frank Mueller

Event Type

ACM Student Research Competition

Poster

Reception

Tags

TimeTuesday, November 14th5:15pm - 7pm

LocationFour Seasons Ballroom

DescriptionHPC systems are well known to endure service downtime due to increasing failures. With enhancements in HPC architectures and design, enabling resilience is extremely challenging due to component scaling and absence of well defined failure indicators. HPC system logs are notorious to be complex and unstructured. Efficient fault prediction to enable proactive recovery mechanisms is the need of the hour to make such systems more robust and reliable. This work addresses such faults in computing systems using a recurrent neural network based technique called LSTM (long short-term memory).

We present our framework Desh : Deep Learning for HPC System Health, which entails a procedure to diagnose and predict failures with acceptable lead times. Desh indicates prospects of indicating failure indicators with enhanced training and classification for generic applicability to other systems. This deep learning based framework gives interesting insights for further work on HPC system reliability.