Large computing clusters, including data centers and supercomputers, are used for a variety of applications including scientific computations and machine learning. Modern compute clusters typically use specialized accelerator hardware to speed up computations. Operators of accelerator-rich clusters aim to have high resource utilization across all users of the cluster. However, these systems are often under-utilized due to performance variability across accelerators; that is, application performance varies across accelerators even when the same application is run on the same type of accelerator. This proposal will develop Fortuna, a set of tools that can be used by cluster operators and researchers to characterize and harness variability across accelerators. First, Fortuna will use new methodologies to characterize how much performance variability exists across a wide range of accelerator hardware. Second, Fortuna will identify which applications are more likely to suffer from performance variability. Finally, Fortuna will include new scheduling mechanisms that can use variability measurements and knowledge about applications to improve utilization.Broader impacts of the proposed research include open-source implementations of algorithms and tools, which will be applicable to many large-scale clusters and lay the groundwork for wider industry adoption. The project will also create course modules on system design principles with heterogeneous hardware and software, based on the tools developed as a part of the proposal. This will teach the next generation of students how to design hardware and software to improve utilization of future systems.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|Effective start/end date
|10/1/23 → 9/30/26
- National Science Foundation: $333,105.00
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.