CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems

Nentawe Gurumdimma, Arshad Jhumka, Maria Liakata, Thuan Chuah, James Browne

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

22 Citations (Scopus)

Abstract

The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%
Original languageEnglish
Title of host publication2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)
PublisherIEEE Explore
ISBN (Electronic)978-1-5090-3513-7
DOIs
Publication statusPublished - Sept 2016

Fingerprint

Dive into the research topics of 'CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems'. Together they form a unique fingerprint.

Cite this