Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

Bo Fang, Daoce Wang, Sian Jin, Quincey Koziol, Zhao Zhang, Qiang Guan, Suren Byna, Sriram Krishnamoorthy, Dingwen Tao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solidstate disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages409-420
Number of pages12
ISBN (Electronic)9781728196664
DOIs
StatePublished - 2021
Externally publishedYes
Event2021 IEEE International Conference on Cluster Computing, Cluster 2021 - Virtual, Portland, United States
Duration: Sep 7 2021Sep 10 2021

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2021-September

Conference

Conference2021 IEEE International Conference on Cluster Computing, Cluster 2021
Country/TerritoryUnited States
CityVirtual, Portland
Period9/7/219/10/21

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights'. Together they form a unique fingerprint.

Cite this