An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems

Edward Chuah, Arshad Jhumka, Sai Narasimhamurthy

Research output: Contribution to journalArticlepeer-review


High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.
Original languageEnglish
Number of pages35
JournalJournal of Supercomputing
Publication statusPublished - 15 May 2023


  • large cluster systems
  • major page faults
  • system logs
  • resource use data
  • Regression Analysis


Dive into the research topics of 'An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems'. Together they form a unique fingerprint.

Cite this