Storage-aware task scheduling for performance optimization of big data workflows

Qianwen Ye, Chase Wu, Huiyan Cao, Nageswara S.V. Rao, Aiqin Hou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many large-scale applications in various domains are generating big data, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.

Original languageEnglish (US)
Title of host publicationProceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
EditorsJinjun Chen, Laurence T. Yang
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1095-1102
Number of pages8
ISBN (Electronic)9781728111414
DOIs
StatePublished - Mar 20 2019
Event16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018 - Melbourne, Australia
Duration: Dec 11 2018Dec 13 2018

Publication series

NameProceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018

Conference

Conference16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
CountryAustralia
CityMelbourne
Period12/11/1812/13/18

Fingerprint

Scheduling
Scheduling algorithms
Engines
Communication
Costs
Big data

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Computational Theory and Mathematics

Keywords

  • Big data workflow
  • Data locality
  • MapReduce
  • Workflow optimization
  • Workflow scheduling

Cite this

Ye, Q., Wu, C., Cao, H., Rao, N. S. V., & Hou, A. (2019). Storage-aware task scheduling for performance optimization of big data workflows. In J. Chen, & L. T. Yang (Eds.), Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018 (pp. 1095-1102). [8672241] (Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BDCloud.2018.00163
Ye, Qianwen ; Wu, Chase ; Cao, Huiyan ; Rao, Nageswara S.V. ; Hou, Aiqin. / Storage-aware task scheduling for performance optimization of big data workflows. Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018. editor / Jinjun Chen ; Laurence T. Yang. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 1095-1102 (Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018).
@inproceedings{2a4108d69d8c4bcb92766d70e832aba4,
title = "Storage-aware task scheduling for performance optimization of big data workflows",
abstract = "Many large-scale applications in various domains are generating big data, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.",
keywords = "Big data workflow, Data locality, MapReduce, Workflow optimization, Workflow scheduling",
author = "Qianwen Ye and Chase Wu and Huiyan Cao and Rao, {Nageswara S.V.} and Aiqin Hou",
year = "2019",
month = "3",
day = "20",
doi = "https://doi.org/10.1109/BDCloud.2018.00163",
language = "English (US)",
series = "Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1095--1102",
editor = "Jinjun Chen and Yang, {Laurence T.}",
booktitle = "Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018",
address = "United States",

}

Ye, Q, Wu, C, Cao, H, Rao, NSV & Hou, A 2019, Storage-aware task scheduling for performance optimization of big data workflows. in J Chen & LT Yang (eds), Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018., 8672241, Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018, Institute of Electrical and Electronics Engineers Inc., pp. 1095-1102, 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018, Melbourne, Australia, 12/11/18. https://doi.org/10.1109/BDCloud.2018.00163

Storage-aware task scheduling for performance optimization of big data workflows. / Ye, Qianwen; Wu, Chase; Cao, Huiyan; Rao, Nageswara S.V.; Hou, Aiqin.

Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018. ed. / Jinjun Chen; Laurence T. Yang. Institute of Electrical and Electronics Engineers Inc., 2019. p. 1095-1102 8672241 (Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Storage-aware task scheduling for performance optimization of big data workflows

AU - Ye, Qianwen

AU - Wu, Chase

AU - Cao, Huiyan

AU - Rao, Nageswara S.V.

AU - Hou, Aiqin

PY - 2019/3/20

Y1 - 2019/3/20

N2 - Many large-scale applications in various domains are generating big data, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.

AB - Many large-scale applications in various domains are generating big data, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.

KW - Big data workflow

KW - Data locality

KW - MapReduce

KW - Workflow optimization

KW - Workflow scheduling

UR - http://www.scopus.com/inward/record.url?scp=85063912833&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063912833&partnerID=8YFLogxK

U2 - https://doi.org/10.1109/BDCloud.2018.00163

DO - https://doi.org/10.1109/BDCloud.2018.00163

M3 - Conference contribution

T3 - Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018

SP - 1095

EP - 1102

BT - Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018

A2 - Chen, Jinjun

A2 - Yang, Laurence T.

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Ye Q, Wu C, Cao H, Rao NSV, Hou A. Storage-aware task scheduling for performance optimization of big data workflows. In Chen J, Yang LT, editors, Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 1095-1102. 8672241. (Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018). https://doi.org/10.1109/BDCloud.2018.00163