TY - GEN
T1 - Linking Resource Usage Anomalies with System Failures from Cluster Log Data
AU - Chuah, Edward
AU - Jhumka, Arshad
AU - Narasimhamurthy, Sai
AU - Hammond, John
AU - Browne, James
AU - Barth, Bill
N1 - Acknowledgements: We thank the Texas Advanced Computing Center (TACC)
for providing the Ranger message logs and resource use data,
and Malcolm Muggeridge (Xyratex) for granting access to
his researchers. This research was supported in part by the
National Science Foundation under OCI award #0622780 and
#1203604 to TACC at the University of Texas at Austin
PY - 2013/9
Y1 - 2013/9
N2 - Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.
AB - Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.
UR - http://dx.doi.org/10.1109/srds.2013.20
U2 - 10.1109/srds.2013.20
DO - 10.1109/srds.2013.20
M3 - Published conference contribution
SP - 111
EP - 120
BT - 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS)
PB - IEEE Explore
ER -