Autotuning divide-and-conquer stencil computations

Ekanathan Palamadai Natarajan, Charles Leiserson

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties—space-time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state-of-the-art heuristic autotuner called OpenTuner in tuning the divide-and-conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5% –12% faster on average, and the OpenTuner tuned code ran from 9% slower to 2% faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40% faster, and the OpenTuner tuned code ran 33% faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.

Original languageEnglish (US)
Article numbere4127
JournalConcurrency Computation
Volume29
Issue number17
DOIs
StatePublished - Sep 10 2017

Fingerprint

Auto-tuning
Divide and conquer
Benchmark
Tuning
Hardware
Heuristics
Divide-and-conquer Algorithm
Pruning
Compiler
Divides
Efficacy
Equivalence
Configuration

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Computer Networks and Communications
  • Computer Science Applications
  • Computational Theory and Mathematics

Keywords

  • autotuning
  • divide-and-conquer
  • stencil computations
  • trapezoidal decomposition

Cite this

Palamadai Natarajan, E., & Leiserson, C. (2017). Autotuning divide-and-conquer stencil computations. Concurrency Computation, 29(17), [e4127]. https://doi.org/10.1002/cpe.4127
Palamadai Natarajan, Ekanathan ; Leiserson, Charles. / Autotuning divide-and-conquer stencil computations. In: Concurrency Computation. 2017 ; Vol. 29, No. 17.
@article{02526a67351e4c18ba55878db624ba61,
title = "Autotuning divide-and-conquer stencil computations",
abstract = "This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties—space-time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state-of-the-art heuristic autotuner called OpenTuner in tuning the divide-and-conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5{\%} –12{\%} faster on average, and the OpenTuner tuned code ran from 9{\%} slower to 2{\%} faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40{\%} faster, and the OpenTuner tuned code ran 33{\%} faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.",
keywords = "autotuning, divide-and-conquer, stencil computations, trapezoidal decomposition",
author = "Ekanathan Palamadai Natarajan and Charles Leiserson",
year = "2017",
month = "9",
day = "10",
doi = "https://doi.org/10.1002/cpe.4127",
language = "English (US)",
volume = "29",
journal = "Concurrency Computation Practice and Experience",
issn = "1532-0626",
publisher = "John Wiley and Sons Ltd",
number = "17",

}

Palamadai Natarajan, E & Leiserson, C 2017, 'Autotuning divide-and-conquer stencil computations', Concurrency Computation, vol. 29, no. 17, e4127. https://doi.org/10.1002/cpe.4127

Autotuning divide-and-conquer stencil computations. / Palamadai Natarajan, Ekanathan; Leiserson, Charles.

In: Concurrency Computation, Vol. 29, No. 17, e4127, 10.09.2017.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Autotuning divide-and-conquer stencil computations

AU - Palamadai Natarajan, Ekanathan

AU - Leiserson, Charles

PY - 2017/9/10

Y1 - 2017/9/10

N2 - This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties—space-time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state-of-the-art heuristic autotuner called OpenTuner in tuning the divide-and-conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5% –12% faster on average, and the OpenTuner tuned code ran from 9% slower to 2% faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40% faster, and the OpenTuner tuned code ran 33% faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.

AB - This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties—space-time equivalence, divide subsumption, and favored dimension—that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We compared the performance of Ztune with that of a state-of-the-art heuristic autotuner called OpenTuner in tuning the divide-and-conquer algorithm used in Pochoir stencil compiler. Over a nightly run on ten application benchmarks across two machines with different hardware configurations, the Ztuned code ran 5% –12% faster on average, and the OpenTuner tuned code ran from 9% slower to 2% faster on average, than Pochoir's default code. In the best case, the Ztuned code ran 40% faster, and the OpenTuner tuned code ran 33% faster than Pochoir's code. Whereas the autotuning time of Ztune for each benchmark could be measured in minutes, to achieve comparable results, the autotuning time of OpenTuner was typically measured in hours or days. Surprisingly, for some benchmarks, Ztune actually autotuned faster than the time it takes to perform the stencil computation once.

KW - autotuning

KW - divide-and-conquer

KW - stencil computations

KW - trapezoidal decomposition

UR - http://www.scopus.com/inward/record.url?scp=85027346399&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85027346399&partnerID=8YFLogxK

U2 - https://doi.org/10.1002/cpe.4127

DO - https://doi.org/10.1002/cpe.4127

M3 - Article

VL - 29

JO - Concurrency Computation Practice and Experience

JF - Concurrency Computation Practice and Experience

SN - 1532-0626

IS - 17

M1 - e4127

ER -