《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1649-1655.DOI: 10.11772/j.issn.1001-9081.2021061404

• 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文 •    下一篇

分布式机器学习作业性能干扰分析与预测

李洪亮1,2, 张弄1, 孙婷1, 李想1()   

  1. 1.吉林大学 计算机科学与技术学院,长春 130012
    2.符号计算与知识工程教育部重点实验室(吉林大学),长春 130012
  • 收稿日期:2021-08-05 修回日期:2021-10-14 接受日期:2021-10-20 发布日期:2022-06-22 出版日期:2022-06-10
  • 通讯作者: 李想
  • 作者简介:李洪亮(1983—),男,吉林长春人,副教授,博士,CCF会员,主要研究方向:分布式计算、高性能计算
    张弄(1999—),男,四川南充人,硕士研究生,主要研究方向:分布式计算、高性能计算
    孙婷(1996—),女,吉林洮南人,硕士,主要研究方向:云计算、高性能计算
  • 基金资助:
    国家重点研发计划项目(2017YFC1502306);国家自然科学基金资助项目(61602205)

Performance interference analysis and prediction for distributed machine learning jobs

Hongliang LI1,2, Nong ZHANG1, Ting SUN1, Xiang LI1()   

  1. 1.College of Computer Science and Technology,Jilin University,Changchun Jilin 130012,China
    2.Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education (Jilin University),Changchun Jilin 130012,China
  • Received:2021-08-05 Revised:2021-10-14 Accepted:2021-10-20 Online:2022-06-22 Published:2022-06-10
  • Contact: Xiang LI
  • About author:LI Hongliang,born in 1983,Ph. D.,associate professor. His research interests include distributed computing, high performance computing.
    ZHANG Nong, born in 1999, M. S.candidate. His research interests include distributed computing, high performance computing.
    SUN Ting, born in 1996, M. S. Her research interests include cloud computing, high performance computing.
  • Supported by:
    National Key Research and Development Program of China(2017YFC1502306);National Natural Science Foundation of China(61602205)

摘要:

通过分析分布式机器学习中作业性能干扰的问题,发现性能干扰是由于内存过载、带宽竞争等GPU资源分配不均导致的,为此设计并实现了快速预测作业间性能干扰的机制,该预测机制能够根据给定的GPU参数和作业类型自适应地预测作业干扰程度。首先,通过实验获取分布式机器学习作业运行时的GPU参数和干扰率,并分析出各类参数对性能干扰的影响;其次,依托多种预测技术建立GPU参数-干扰率模型进行作业干扰率误差分析;最后,建立自适应的作业干扰率预测算法,面向给定的设备环境和作业集合自动选择误差最小的预测模型,快速、准确地预测作业干扰率。选取5种常用的神经网络作业,在两种GPU设备上设计实验并进行结果分析。结果显示,所提出的自适应干扰预测(AIP)机制能够在不提供任何预先假设信息的前提下快速完成预测模型的选择和性能干扰预测,耗时在300 s以内,预测干扰率误差在2%~13%,可应用于作业调度和负载均衡等场景。

关键词: 分布式机器学习, 性能干扰, 集群调度, 资源共享, 干扰预测

Abstract:

By analyzing the problem of job performance interference in distributed machine learning, it is found that performance interference is caused by the uneven allocation of GPU resources such as memory overload and bandwidth competition, and to this end, a mechanism for quickly predicting performance interference between jobs was designed and implemented, which can adaptively predict the degree of job interference according to the given GPU parameters and job types. First, the GPU parameters and interference rates during the operation of distributed machine learning jobs were obtained through experiments, and the influences of various parameters on performance interference were analyzed. Second, some GPU parameter-interference rate models were established by using multiple prediction technologies to analyze the job interference rate errors. Finally, an adaptive job interference rate prediction algorithm was proposed to automatically select the prediction model with the smallest error for a given equipment environment and job set to predict the job interference rates quickly and accurately. By selecting five commonly used neural network tasks, experiments were designed on two GPU devices and the results were analyzed. The results show that the proposed Adaptive Interference Prediction (AIP) mechanism can quickly complete the selection of prediction model and the performance interference prediction without providing any pre-assumed information, it has comsumption time less than 300 s and achieves prediction error rate in the range of 2% to 13%, which can be applied to scenarios such as job scheduling and load balancing.

Key words: distributed machine learning, performance interference, cluster scheduling, resource sharing, interference prediction

中图分类号: