Establishing Hypothesis for Recurrent System Failures from Cluster Log Files

Thuan Chuah, Gary Lee, William-Chandra Tjhi, Shyh-hao Kuo, Terence Hung, John Hammond, Tommy Minyard, James Browne

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

9 Citations (Scopus)

Abstract

A goal for the analysis of supercomputer logs is to
establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships
is at the heart of failure diagnosis. In principle, a log analysis tool
could automate many of the manual steps systems administrators
must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is
difficult. This paper describes the second generation FDiag logbased failure diagnostics framework that provides automation
of the manual failure diagnosis process and determines with
high confidence, the likely cause of the failure, the components
involved and the event sequences which contain the times of the
causal and terminal events. FDiag extracts relevant events from
the system logs, performs correlation analysis on these events and
from these correlations determines the components involved and
the event sequences. The diagnostics capabilities of FDiag are
validated by comparing its assessments on known instances of
recurrent failures on the Ranger supercomputer at the University
of Texas at Austin. We believe FDiag is the first log analyzer to
demonstrate this level of diagnostics capability from the system
logs of an open source software stack incorporating Linux and
the Lustre file system. FDiag will be put into production use for
support of failure diagnosis on Ranger in September, 2011.
Original languageEnglish
Title of host publication2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC)
PublisherIEEE Explore
Pages15-22
Number of pages8
DOIs
Publication statusPublished - 14 Dec 2011

Bibliographical note

Acknowledgements: We would like to thank the Texas Advanced Computing
Center for providing the Ranger system logs and case studies.
This research was supported in part by the National Science
Foundation under OCI award #0622780 to the Texas Advanced
Computing Center at the University of Texas at Austin.

Fingerprint

Dive into the research topics of 'Establishing Hypothesis for Recurrent System Failures from Cluster Log Files'. Together they form a unique fingerprint.

Cite this