基于MPI+CUDA异步模型的并行矩阵乘法

计算机应用 ›› 2011, Vol. 31 ›› Issue (12): 3327-3330.

基于MPI+CUDA异步模型的并行矩阵乘法

刘青昆,马名威,阎慰椿

辽宁师范大学计算机与信息技术学院，辽宁大连 116081

收稿日期:2011-06-20 修回日期:2011-08-03 发布日期:2011-12-12 出版日期:2011-12-01
通讯作者: 马名威
基金资助:
国家自然科学基金资助项目

Parallel matrix multiplication based on MPI+CUDA asynchronous model

LIU Qing-kun,MA Ming-wei,YAN Wei-chun

College of Computer and Information Technology, Liaoning Normal University, Dalian Liaoning 116081, China

Received:2011-06-20 Revised:2011-08-03 Online:2011-12-12 Published:2011-12-01
Contact: MA Ming-wei

摘要/Abstract

摘要： 矩阵乘法在科学计算领域中起着重要的作用，不同结构模型能够改善并行矩阵乘的性能。现有的MPI+CUDA同步模型中，主机端需要进入等待状态，直到设备端完成任务后才能继续工作,这显然浪费时间。针对上述问题，提出一种基于MPI+CUDA异步模型的并行矩阵乘法。该模型避免了主机端进入等待状态，并采用CUDA流技术解决数据量超过GPU内存问题。通过分析异步模型的加速比和效率，实验结果表明，此方法显著提高了并行效率和大型矩阵乘法的运算速度，充分发挥了节点间分布式存储和节点内共享内存的优势，是一种有效可行的并行策略。

关键词: 矩阵乘法, 并行计算, 混合编程, 消息传递接口, 统一计算设备架构

Abstract: Matrix multiplication plays an important role in scientific computing. Different structural models can improve the performance of parallel matrix multiplication. In the existing MPI+CUDA synchronization model, the host-side need enter the waiting state and cannot continue to work until the device completes the task, which obviously wastes time. Concerning this question, a parallel matrix multiplication based on MPI+CUDA asynchronous model was proposed. This model prevented host-sides entering into the waiting state, and used CUDA-stream technology to solve the problem of data bulk over GPU memory. By analyzing the speedup ratio and efficiency of the asynchronous model, the experimental results show that MPI+CUDA parallel programming obviously promotes parallel efficiency and large-scale matrix multiplication’s speed，which exerts the advantages of the distributional memory between the nodes and the share memory in the node. It is an effective and feasible parallel strategy.

Key words: matrix multiplication, parallel computing, hybrid programming, Message Passing Interface (MPI), Computer Unified Device Architecture (CUDA)

中图分类号:

TP311

刘青昆马名威阎慰椿. 基于MPI+CUDA异步模型的并行矩阵乘法[J]. 计算机应用, 2011, 31(12): 3327-3330.

LIU Qing-kun MA Ming-wei YAN Wei-chun. Parallel matrix multiplication based on MPI+CUDA asynchronous model[J]. Journal of Computer Applications, 2011, 31(12): 3327-3330.

[1]	张润莲, 张密, 武小年, 舒瑞. 基于GPU的大状态密码S盒差分性质评估方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2785-2790.
[2]	刘丽, 陈长波. 带状稀疏矩阵乘法及高效GPU实现[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3856-3867.
[3]	刘乾, 张洋铭, 万定生. 网格化分布式新安江模型并行计算算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3327-3333.
[4]	姜松岩, 廖晓鹃, 陈光柱. 基于可满足性模理论的多处理机通信延迟优化任务调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 185-191.
[5]	蔡婧雯, 韦永壮, 刘争红. 基于GPU的密码S盒代数性质评估方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2750-2756.
[6]	邱鑫源, 叶泽聪, 崔翛龙, 高志强. 联邦学习通信开销研究综述[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 333-342.
[7]	徐启迪, 刘争红, 郑霖. 基于GPU的低密度奇偶校验码译码加速技术[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3841-3846.
[8]	解文博, 韦永壮, 刘争红. 基于CUDA的SKINNY加密算法并行实现与分析[J]. 计算机应用, 2021, 41(4): 1136-1141.
[9]	杨先凤, 贵红军, 傅春常. 统一计算设备架构下的F-X域预测滤波并行算法[J]. 计算机应用, 2021, 41(2): 486-491.
[10]	曾志阳, 陈燕, 王珂. 圆片下料并行遗传算法的设计与实现[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 392-397.
[11]	宋祥帅, 杨伏长, 谢江, 张武. Graphlet Degree Vector方法的优化与并行[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 398-403.
[12]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[13]	何希, 吴炎桃, 邸臻炜, 陈佳. 基于图形处理器的形态学重建系统[J]. 计算机应用, 2019, 39(7): 2008-2013.
[14]	沙宗轩, 薛菲, 朱杰. 基于并行强化学习的云机器人任务调度策略[J]. 计算机应用, 2019, 39(2): 501-508.
[15]	冯凯, 李婧. k元n方体的可靠性评估[J]. 计算机应用, 2019, 39(11): 3323-3327.