TY - GEN
T1 - Characterizing Impacts of Storage Faults on HPC Applications
T2 - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
AU - Fang, Bo
AU - Wang, Daoce
AU - Jin, Sian
AU - Koziol, Quincey
AU - Zhang, Zhao
AU - Guan, Qiang
AU - Byna, Suren
AU - Krishnamoorthy, Sriram
AU - Tao, Dingwen
N1 - Publisher Copyright: ©2021 IEEE.
PY - 2021
Y1 - 2021
N2 - In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solidstate disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.
AB - In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solidstate disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.
UR - http://www.scopus.com/inward/record.url?scp=85126056790&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126056790&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/Cluster48925.2021.00048
DO - https://doi.org/10.1109/Cluster48925.2021.00048
M3 - Conference contribution
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 409
EP - 420
BT - Proceedings - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 7 September 2021 through 10 September 2021
ER -