TY - JOUR
T1 - Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication
AU - Duan, Yubin
AU - Wang, Ning
AU - Wu, Jie
N1 - Funding Information: Manuscript received January 17, 2021; revised March 22, 2021; accepted April 13, 2021. Date of publication April 16, 2021; date of current version July 7, 2021. This research was supported in part by NSF grants CNS 1824440, CNS 1828363, CNS 1757533, CNS 1629746, CNS 1651947, and CNS 1564128. Recommended for acceptance by Dr. Celimuge Wu. (Corresponding author: Ning Wang.) Yubin Duan and Jie Wu are with the Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122 USA (e-mail: yubin. duan@temple.edu; jiewu@temple.edu). Publisher Copyright: © 2013 IEEE.
PY - 2021/4/1
Y1 - 2021/4/1
N2 - Due to the additive property of most machine learning objective functions, the training can be distributed to multiple machines. Distributed machine learning is an efficient way to deal with the rapid growth of data volume at the cost of extra inter-machine communication. One common implementation is the parameter server system which contains two types of nodes: worker nodes, which are used for calculating updates, and server nodes, which are used for maintaining parameters. We observe that inefficient communication between workers and servers may slow down the system. Therefore, we propose a graph partition problem to partition data among workers and parameters among servers such that the total training time is minimized. Our problem is NP-Complete. We investigate a two-step heuristic approach that first partitions data, and then partitions parameters. We consider the trade-off between partition time and the saving in training time. Besides, we adopt a multilevel graph partition approach to fit the bipartite graph partitioning. We implement both approaches based on an open-source parameter server platform - PS-lite. Experiment results on synthetic and real-world datasets show that both approaches could significantly improve the communication efficiency up to 14 times compared with the random partition.
AB - Due to the additive property of most machine learning objective functions, the training can be distributed to multiple machines. Distributed machine learning is an efficient way to deal with the rapid growth of data volume at the cost of extra inter-machine communication. One common implementation is the parameter server system which contains two types of nodes: worker nodes, which are used for calculating updates, and server nodes, which are used for maintaining parameters. We observe that inefficient communication between workers and servers may slow down the system. Therefore, we propose a graph partition problem to partition data among workers and parameters among servers such that the total training time is minimized. Our problem is NP-Complete. We investigate a two-step heuristic approach that first partitions data, and then partitions parameters. We consider the trade-off between partition time and the saving in training time. Besides, we adopt a multilevel graph partition approach to fit the bipartite graph partitioning. We implement both approaches based on an open-source parameter server platform - PS-lite. Experiment results on synthetic and real-world datasets show that both approaches could significantly improve the communication efficiency up to 14 times compared with the random partition.
UR - http://www.scopus.com/inward/record.url?scp=85104675336&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104675336&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/TNSE.2021.3073897
DO - https://doi.org/10.1109/TNSE.2021.3073897
M3 - Article
SN - 2327-4697
VL - 8
SP - 1802
EP - 1814
JO - IEEE Transactions on Network Science and Engineering
JF - IEEE Transactions on Network Science and Engineering
IS - 2
M1 - 9406385
ER -