Novel submission modes for tightly coupled jobs across distributed resources for reduced time to-solution

Promita Chakraborty, Shantenu Jha, Daniel S. Katz

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

The problems of scheduling a single parallel job across a large-scale distributed system are well known and surprisingly difficult to solve. In addition, because of the issues involved in distributed submission, such as co-reserving resources, and managing accounts and certificates simultaneously on multiple machines, etc., the vast number of highperformance computing (HPC) application users have been happy to remain restricted to submitting jobs to single machines. Meanwhile, the need to simulate larger and more complex physical systems continues to grow, with a concomitant increase in the number of cores required to solve the resulting scientific problems. One might reduce the demand on load per machine, and eventually the wait-time in queue, by decomposing the problem to use two resources in such circumstances, even though there might be a reduction in the peak performance. This motivates a question. Can otherwise monolithic jobs running on single resources be distributed over more than one machine such that there is an overall reduction in the time-to-solution? In this paper, we briefly discuss the development and performance of a parallel molecular dynamics code and its generalization to work on multiple distributed machines (using MPICH-G2). We benchmark and validate the performance of our simulations over multiple input datasets of varying sizes. The primary aim of this work, however, is to show that the time-to-solution can be reduced by sacrificing some peak performance and distributing over multiple machines.

Original languageEnglish (US)
Pages (from-to)2545-2556
Number of pages12
JournalPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Volume367
Issue number1897
DOIs
StatePublished - Jun 28 2009
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Physics and Astronomy(all)
  • Mathematics(all)

Keywords

  • Job submission paradigm
  • Scheduling
  • Tightly coupled distributed performance

Fingerprint

Dive into the research topics of 'Novel submission modes for tightly coupled jobs across distributed resources for reduced time to-solution'. Together they form a unique fingerprint.

Cite this