Abstract
A goal for the analysis of supercomputer logs is to
establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships
is at the heart of failure diagnosis. In principle, a log analysis tool
could automate many of the manual steps systems administrators
must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is
difficult. This paper describes the second generation FDiag logbased failure diagnostics framework that provides automation
of the manual failure diagnosis process and determines with
high confidence, the likely cause of the failure, the components
involved and the event sequences which contain the times of the
causal and terminal events. FDiag extracts relevant events from
the system logs, performs correlation analysis on these events and
from these correlations determines the components involved and
the event sequences. The diagnostics capabilities of FDiag are
validated by comparing its assessments on known instances of
recurrent failures on the Ranger supercomputer at the University
of Texas at Austin. We believe FDiag is the first log analyzer to
demonstrate this level of diagnostics capability from the system
logs of an open source software stack incorporating Linux and
the Lustre file system. FDiag will be put into production use for
support of failure diagnosis on Ranger in September, 2011.
establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships
is at the heart of failure diagnosis. In principle, a log analysis tool
could automate many of the manual steps systems administrators
must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is
difficult. This paper describes the second generation FDiag logbased failure diagnostics framework that provides automation
of the manual failure diagnosis process and determines with
high confidence, the likely cause of the failure, the components
involved and the event sequences which contain the times of the
causal and terminal events. FDiag extracts relevant events from
the system logs, performs correlation analysis on these events and
from these correlations determines the components involved and
the event sequences. The diagnostics capabilities of FDiag are
validated by comparing its assessments on known instances of
recurrent failures on the Ranger supercomputer at the University
of Texas at Austin. We believe FDiag is the first log analyzer to
demonstrate this level of diagnostics capability from the system
logs of an open source software stack incorporating Linux and
the Lustre file system. FDiag will be put into production use for
support of failure diagnosis on Ranger in September, 2011.
Original language | English |
---|---|
Title of host publication | 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC) |
Publisher | IEEE Explore |
Pages | 15-22 |
Number of pages | 8 |
DOIs | |
Publication status | Published - 14 Dec 2011 |
Bibliographical note
Acknowledgements: We would like to thank the Texas Advanced ComputingCenter for providing the Ranger system logs and case studies.
This research was supported in part by the National Science
Foundation under OCI award #0622780 to the Texas Advanced
Computing Center at the University of Texas at Austin.