TY - GEN
T1 - Performance Modeling and Prediction of Big Data Workflows
T2 - 29th International Conference on Computer Communications and Networks, ICCCN 2020
AU - Liu, Wuji
AU - Wu, Chase Q.
AU - Ye, Qianwen
AU - Hou, Aiqin
AU - Shen, Wei
N1 - Publisher Copyright: © 2020 IEEE.
PY - 2020/8
Y1 - 2020/8
N2 - Many next-generation scientific and business applications feature large-scale data-intensive workflows, which require massive computing resources for execution on high-performance clusters in cloud environments. Such computing resources (e.g., VCores and virtual memory) requested through parameter setting in big data systems, if not fully utilized by workloads, are simply wasted due to the nature of exclusive access made possible by containerization. This necessitates accurate modeling and prediction of workflow performance to make an effective recommendation of appropriate parameter settings to end users. However, it is challenging to determine optimal workflow and system configurations due to the large parameter space and the interaction between various technology layers of big data systems. Towards this goal, we propose a machine learning-based feature selection method to identify influential parameters based on historical performance measurements of Spark-based computing workloads executed in big data systems with YARN. We first identify a comprehensive set of parameters across multiple layers in the big data technology stack including workflow input structure, Spark computing engine, and YARN resource management. We then conduct an in-depth exploratory analysis of their individual and coupled impact on workflow performance, and develop a performance-influence model using random forest for prediction. Experimental results show that the proposed approach identifies important features for performance modeling and achieves high accuracy in performance prediction.
AB - Many next-generation scientific and business applications feature large-scale data-intensive workflows, which require massive computing resources for execution on high-performance clusters in cloud environments. Such computing resources (e.g., VCores and virtual memory) requested through parameter setting in big data systems, if not fully utilized by workloads, are simply wasted due to the nature of exclusive access made possible by containerization. This necessitates accurate modeling and prediction of workflow performance to make an effective recommendation of appropriate parameter settings to end users. However, it is challenging to determine optimal workflow and system configurations due to the large parameter space and the interaction between various technology layers of big data systems. Towards this goal, we propose a machine learning-based feature selection method to identify influential parameters based on historical performance measurements of Spark-based computing workloads executed in big data systems with YARN. We first identify a comprehensive set of parameters across multiple layers in the big data technology stack including workflow input structure, Spark computing engine, and YARN resource management. We then conduct an in-depth exploratory analysis of their individual and coupled impact on workflow performance, and develop a performance-influence model using random forest for prediction. Experimental results show that the proposed approach identifies important features for performance modeling and achieves high accuracy in performance prediction.
KW - Big data workflows
KW - Spark
KW - machine learning
KW - performance modeling and prediction
KW - representation learning
UR - http://www.scopus.com/inward/record.url?scp=85093855591&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093855591&partnerID=8YFLogxK
U2 - 10.1109/ICCCN49398.2020.9209715
DO - 10.1109/ICCCN49398.2020.9209715
M3 - Conference contribution
T3 - Proceedings - International Conference on Computer Communications and Networks, ICCCN
BT - ICCCN 2020 - 29th International Conference on Computer Communications and Networks
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 August 2020 through 6 August 2020
ER -