《计算机应用》唯一官方网站 ›› 2020, Vol. 40 ›› Issue (2): 311-315.DOI: 10.11772/j.issn.1001-9081.2019091640

• 2019年全国开放式分布与并行计算学术年会(DPCS 2019)论文 •    下一篇

面向期限感知分布式矩阵相乘的高效存储方案

赵永柱1, 黎卫东2(), 唐斌2, 梅峰3, 卢文达3   

  1. 1. 国网陕西省电力公司信息通信公司,西安 710065
    2. 计算机软件新技术国家重点实验室(南京大学),南京 210023
    3. 国网浙江省电力有限公司,杭州 310007
  • 收稿日期:2019-07-31 修回日期:2019-10-14 接受日期:2019-10-16 发布日期:2020-02-29 出版日期:2021-12-02
  • 通讯作者: 黎卫东
  • 作者简介:赵永柱(1974—),男,陕西乾县人,高级工程师,硕士,主要研究方向:电力信息系统、数据中心
    唐斌(1986—),男,江苏东台人,助理研究员,博士,CCF会员,主要研究方向:分布式计算、编码理论
    梅峰(1977—),男,浙江湖州人,高级工程师,硕士,主要研究方向:电力信息系统、大数据
    卢文达(1989—),男,吉林松原人,助理工程师,硕士,主要研究方向:数据挖掘、云计算。
  • 基金资助:
    国家自然科学基金资助项目(61832005);国家电网公司科技项目(52110418001M)

Efficient storage scheme for deadline aware distributed matrix multiplication

Yongzhu ZHAO1, Weidong LI2(), Bin TANG2, Feng MEI3, Wenda LU3   

  1. 1. State Grid Shaanxi Electric Power Information & Telecommunication Company Limited,Xi’an Shaanxi 710065,China
    2. State Key Laboratory for Novel Software Technology (Nanjing University),Nanjing Jiangsu 210023,China
    3. State Grid Zhejiang Electric Power Company Limited,Hangzhou Zhejiang 310007,China
  • Received:2019-07-31 Revised:2019-10-14 Accepted:2019-10-16 Online:2020-02-29 Published:2021-12-02
  • Contact: Weidong LI
  • About author:ZHAO Yongzhu, born in 1974, M. S., senior engineer. His research interests include power information system, data center.
    TANG Bin, born in 1986, Ph. D., assistant research fellow. His research interests include distributed computing, coding theory.
    MEI Feng, born in 1977, M. S., senior engineer. His research interests include power information system, big data.
    LU Wenda, born in 1989, M. S., assistant engineer. His research interests include data mining, cloud computing.
  • Supported by:
    the National Natural Science Foundation of China(61832005);the Science and Technology Project of State Grid Corporation of China(52110418001M)

摘要:

分布式矩阵相乘是众多分布式机器学习、科学计算等应用中的关键操作,但其性能会受到系统中常见的落后节点的严重影响。最近研究者提出了基于喷泉码的编码矩阵相乘方法,能够充分利用落后节点的部分计算结果,从而大幅度减轻落后节点问题,但忽略了工作节点的存储开销。在考虑存储开销与计算完成时间之间的权衡关系的基础上,首先提出了面向异构工作节点的计算期限感知的存储优化问题;然后进一步通过理论分析,提出了基于期望近似的解决思路,并通过松弛将问题转化为凸优化问题以方便高效求解。仿真实验表明,在保证较大的任务成功率的情况下,所提方案的存储开销会随着任务期限的放宽迅速下降,并且该方案能够更大幅度降低编码带来的存储开销。也就是说,所提方案能够在保障整体计算在期限内大概率完成的前提下,大幅度降低总体的额外存储负载。

关键词: 分布式计算, 矩阵相乘, 落后节点, 凸优化, 期限感知

Abstract:

Distributed matrix multiplication is a fundamental operation in many distributed machine learning and scientific computing applications, but its performance is greatly influenced by the stragglers commonly existed in the systems. Recently, researchers have proposed a fountain code based coded matrix multiplication method, which can effectively mitigate the effect of stragglers by fully exploiting the partial results of stragglers. However, it lacks the consideration of the storage cost of worker nodes. By considering the tradeoff relationship between the storage cost and the finish time of computation, the computational deadline-aware storage optimization problem for heterogeneous worker nodes was proposed firstly. Then, through the theoretical analysis, the solution based on expectation approximation was presented, and the problem was transformed into a convex optimization problem by relaxation for efficient solution. Simulation results show that in the case of ensuring a large task success rate, the storage overhead of the proposed scheme will rapidly decrease as the task duration is relaxed, and the scheme can greatly reduce the storage overhead brought by encoding. In other words, the proposed scheme can significantly reduce the extra storage overhead while guaranteeing that the whole computation can be finished before the deadline with high probability.

Key words: distributed computation, matrix multiplication, straggler, convex optimization, deadline aware

中图分类号: