P89: Desh: Deep Learning for HPC System Health Resilience
SessionPoster Reception
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionHPC systems are well known to endure service downtime
due to increasing failures. With enhancements in HPC
architectures and design, enabling resilience is
extremely challenging due to component scaling and
absence of well defined failure indicators. HPC system
logs are notorious to be complex and unstructured.
Efficient fault prediction to enable proactive recovery
mechanisms is the need of the hour to make such systems
more robust and reliable. This work addresses such
faults in computing systems using a recurrent neural
network based technique called LSTM (long short-term
memory).
We present our framework Desh : Deep Learning for HPC System Health, which entails a procedure to diagnose and predict failures with acceptable lead times. Desh indicates prospects of indicating failure indicators with enhanced training and classification for generic applicability to other systems. This deep learning based framework gives interesting insights for further work on HPC system reliability.
We present our framework Desh : Deep Learning for HPC System Health, which entails a procedure to diagnose and predict failures with acceptable lead times. Desh indicates prospects of indicating failure indicators with enhanced training and classification for generic applicability to other systems. This deep learning based framework gives interesting insights for further work on HPC system reliability.




