计算机应用 ›› 2011, Vol. 31 ›› Issue (12): 3327-3330.

• 先进计算 • 上一篇    下一篇

基于MPI+CUDA异步模型的并行矩阵乘法

刘青昆,马名威,阎慰椿   

  1. 辽宁师范大学 计算机与信息技术学院,辽宁 大连 116081
  • 收稿日期:2011-06-20 修回日期:2011-08-03 发布日期:2011-12-12 出版日期:2011-12-01
  • 通讯作者: 马名威
  • 基金资助:
    国家自然科学基金资助项目

Parallel matrix multiplication based on MPI+CUDA asynchronous model

LIU Qing-kun,MA Ming-wei,YAN Wei-chun   

  1. College of Computer and Information Technology, Liaoning Normal University, Dalian Liaoning 116081, China
  • Received:2011-06-20 Revised:2011-08-03 Online:2011-12-12 Published:2011-12-01
  • Contact: MA Ming-wei

摘要: 矩阵乘法在科学计算领域中起着重要的作用,不同结构模型能够改善并行矩阵乘的性能。现有的MPI+CUDA同步模型中,主机端需要进入等待状态,直到设备端完成任务后才能继续工作,这显然浪费时间。针对上述问题,提出一种基于MPI+CUDA异步模型的并行矩阵乘法。该模型避免了主机端进入等待状态,并采用CUDA流技术解决数据量超过GPU内存问题。通过分析异步模型的加速比和效率,实验结果表明,此方法显著提高了并行效率和大型矩阵乘法的运算速度,充分发挥了节点间分布式存储和节点内共享内存的优势,是一种有效可行的并行策略。

关键词: 矩阵乘法, 并行计算, 混合编程, 消息传递接口, 统一计算设备架构

Abstract: Matrix multiplication plays an important role in scientific computing. Different structural models can improve the performance of parallel matrix multiplication. In the existing MPI+CUDA synchronization model, the host-side need enter the waiting state and cannot continue to work until the device completes the task, which obviously wastes time. Concerning this question, a parallel matrix multiplication based on MPI+CUDA asynchronous model was proposed. This model prevented host-sides entering into the waiting state, and used CUDA-stream technology to solve the problem of data bulk over GPU memory. By analyzing the speedup ratio and efficiency of the asynchronous model, the experimental results show that MPI+CUDA parallel programming obviously promotes parallel efficiency and large-scale matrix multiplication’s speed,which exerts the advantages of the distributional memory between the nodes and the share memory in the node. It is an effective and feasible parallel strategy.

Key words: matrix multiplication, parallel computing, hybrid programming, Message Passing Interface (MPI), Computer Unified Device Architecture (CUDA)

中图分类号: