Diagnosing the root-causes of failures from cluster log files

Thuan Chuah, Shyh-hao Kuo, Paul Hiew, William-Chandra Tjhi, Gary Lee, John Hammond, Marek Michalewicz, Terence Hung, James Browne

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

33 Citations (Scopus)

Abstract

System event logs are often the primary source
of information for diagnosing (and predicting) the causes of
failures for cluster systems. Due to interactions among the
system hardware and software components, the system event
logs for large cluster systems are comprised of streams of
interleaved events, and only a small fraction of the events over
a small time span are relevant to the diagnosis of a given
failure. Furthermore, the process of troubleshooting the causes of
failures is largely manual and ad-hoc. In this paper, we present
a systematic methodology for reconstructing event order and
establishing correlations among events which indicate the rootcauses of a given failure from very large syslogs. We developed
a diagnostics tool, FDiag, to extract the log entries as structured
message templates and uses statistical correlation analysis to
establish probable cause and effect relationships for the fault
being analyzed. We applied FDiag to analyze failures due to
breakdowns in interactions between the Lustre file system and
its clients on the Ranger supercomputer at the Texas Advanced
Computing Center (TACC). The results are positive. FDiag is
able to identify the dates and the time periods that contain
the significant events which eventually led to the occurrence of
compute node soft lockups.
Original languageEnglish
Title of host publication2010 IEEE International Conference on High Performance Computing (HiPC)
PublisherIEEE Explore
Pages1-10
Number of pages10
DOIs
Publication statusPublished - 19 Dec 2010

Fingerprint

Dive into the research topics of 'Diagnosing the root-causes of failures from cluster log files'. Together they form a unique fingerprint.

Cite this