Failure Diagnosis for Cluster Systems using Partial Correlations

Thuan Chuah, Arshad Jhumka, Samantha Alt, Richard Evans, Neeraj Suri

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

Abstract

Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previously unknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.
Original languageEnglish
Title of host publicationIEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
PublisherIEEE Explore
Pages1091-1101
Number of pages11
DOIs
Publication statusPublished - Sep 2021

Cite this