Using resource use data and system logs for hpc system error propagation and recovery diagnosis

Edward Chuah, Arshad Jhumka, Samantha Alt, J. J. Villalobos, Joshua Fryman, William Barth, Manish Parashar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Analyzing failures is important for the reliability of HPC systems and failure diagnosis based only on system logs is incomplete. Resource use data - made available recently - is another potential source of data for failure analysis. Recent work that combines analysis of system logs with resource use data show promising results. In this paper, we describe a new workflow for combining system resource usage and failure logs for diagnosis. The workflow - called EXERMEST - identifies significant system counters and events then correlates them to failures and recovery. We apply EXERMEST on the Ranger HPC system cluster log-data and show that it improves diagnosis over previous research. EXERMEST: (i) show that more system counters and errors can be identified only by applying more feature extractors, (ii) identify CPU I/O bottlenecks and Lustre client eviction, (iii) identify network packet drops and Lustre I/O errors, (iv) identify virtual memory and harddisk I/O errors, (v) show that time-bins of different granularities are required for identifying the errors. EXERMEST is available on the public domain for supporting system administrators in failure diagnosis.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking, ISPA/BDCloud/SustainCom/SocialCom 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages458-467
Number of pages10
ISBN (Electronic)9781728143286
DOIs
StatePublished - Dec 2019
Externally publishedYes
Event17th IEEE International Conference on Parallel and Distributed Processing with Applications, 9th IEEE International Conference on Big Data and Cloud Computing, 9th IEEE International Conference on Sustainable Computing and Communications, 12th IEEE Inte... - Xiamen, China
Duration: Dec 16 2019Dec 18 2019

Publication series

NameProceedings - 2019 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking, ISPA/BDCloud/SustainCom/SocialCom 2019

Conference

Conference17th IEEE International Conference on Parallel and Distributed Processing with Applications, 9th IEEE International Conference on Big Data and Cloud Computing, 9th IEEE International Conference on Sustainable Computing and Communications, 12th IEEE Inte...
Country/TerritoryChina
CityXiamen
Period12/16/1912/18/19

All Science Journal Classification (ASJC) codes

  • Information Systems and Management
  • Communication
  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications
  • Computer Science Applications
  • Renewable Energy, Sustainability and the Environment

Keywords

  • Correlation
  • Error propagation and recovery
  • Feature extraction
  • Hpc
  • Resource use data and system logs

Fingerprint

Dive into the research topics of 'Using resource use data and system logs for hpc system error propagation and recovery diagnosis'. Together they form a unique fingerprint.

Cite this