Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers

Project Details

Description

Today's deep learning (DL) revolution is enabled by efficient deep neural network (DNN) training methods that capture important patterns within large quantities of data in compact, easily usable DNN models. DL methods are applied routinely to tasks like natural language translation and image labeling--and, in science and engineering, to problems as diverse as drug design, environmental monitoring, and fusion energy. Yet as data sizes increase and DL methods grow in sophistication, the time required to train new models often emerges as a major challenge. The Scalable Deep Learning (ScaDL) project will address this challenge by making it possible to use specialized high-performance computing (HPC) systems to train bigger models more rapidly. Efficient use of the thousands of powerful processors in modern HPC systems for DNN training has previously been stymied by communication costs that grow rapidly with the number of processors used. ScaDL will overcome this obstacle by developing new DNN training methods that reduce communication requirements by performing additional computation, by validating the effectiveness of these new methods in a range of scientific applications that use DL in different ways, and by integrating the new methods into scalable DL software for use by domain scientists, computer scientists, and engineers supporting DL application in HPC centers. By permitting the use of powerful HPC systems to train DNN models thousands of times faster than on a single computer, ScaDL will enable advances in many areas of science and engineering. The project will also contribute to educational outcomes by engaging PhD students in project goals, by using ScaDL tools in a new DL systems engineering class at the University of Chicago, and by enlisting participants in summer schools at the Texas Advanced Computing Center (TACC) and U. Chicago, both of which target recruitment of students from underserved communities at the graduate, undergraduate, and high-school levels, to apply the tools to scientific problems. ScaDL's focus on science applications and education aligns the project with NSF's mission of promoting the progress of science.

The ScaDL project contributes to science in two ways. First, it explores new techniques for enhancing the speed and scalability of commonly used optimization methods without losing model performance, by: 1) exploiting scalable algorithms for second-order information approximation; 2) developing methods for adapting to different computer hardware by tuning computation and communication to maximize training speed; 3) exploring compression techniques to reduce communication overheads; 4) using well-known benchmark applications to evaluate the convergence of ScaDL; and 5) applying its new algorithms and systems to science applications. Second, it will release an open-source implementation of the proposed algorithms and system. The implementation will be available on a variety of hardware platforms and capable of choosing the ratio of computation and communication required to make efficient use of the computation and communication hardware on a particular HPC system. The resulting algorithms and system will help disseminate ScaDL research results to a wide spectrum of research domains and users, and promote the adoption of the new methods in practical settings.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

StatusActive
Effective start/end date10/1/219/30/24

Funding

  • National Science Foundation: $226,442.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.