Insights into the Diagnosis of System Failures from Cluster Message Logs

Thuan Chuah, Arshad Jhumka, James Browne, Bill Barth, Sai Narasimhamurthy

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

9 Citations (Scopus)

Abstract

Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to cluster system failures? To this end, this paper gives a systematic process for identifying and characterizing the root-causes of failures. We applied an extended version of the FDiagV3 diagnostics toolkit to the log-files of the Ranger and Lonestar supercomputers. Our results show that: (i) failures were a result of recurrent issues and errors, (ii) a small set of nodes are associated with these issues and errors, and (iii) Ranger and Lonestar display similar sets of problems. FDiagV3 will be put in the public domain for support of failure diagnosis for large cluster systems in May, 2015.
Original languageEnglish
Title of host publication2015 11th European Dependable Computing Conference (EDCC)
PublisherIEEE Explore
DOIs
Publication statusPublished - Sept 2015

Fingerprint

Dive into the research topics of 'Insights into the Diagnosis of System Failures from Cluster Message Logs'. Together they form a unique fingerprint.

Cite this