Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Edward Chuah, Arshad Jhumka, Sai Narasimhamurthy, John Hammond, James Browne, Bill Barth

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

33 Citations (Scopus)

Abstract

Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.
Original languageEnglish
Title of host publication2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS)
PublisherIEEE Explore
Pages111-120
Number of pages10
DOIs
Publication statusPublished - Sep 2013

Fingerprint

Dive into the research topics of 'Linking Resource Usage Anomalies with System Failures from Cluster Log Data'. Together they form a unique fingerprint.

Cite this