Online failure prediction for HPC resources using decentralized clustering

Alejandro Pelaez, Andres Quiroz, James Browne, Thuan Chuah, Manish Parashar

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

13 Citations (Scopus)

Abstract

Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.
Original languageEnglish
Title of host publication2014 21st International Conference on High Performance Computing (HiPC)
PublisherIEEE Explore
ISBN (Electronic)978-1-4799-5976-1
DOIs
Publication statusPublished - Dec 2014

Fingerprint

Dive into the research topics of 'Online failure prediction for HPC resources using decentralized clustering'. Together they form a unique fingerprint.

Cite this