Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (12): 3856-3867.DOI: 10.11772/j.issn.1001-9081.2022111720

• Advanced computing • Previous Articles     Next Articles

Band sparse matrix multiplication and efficient GPU implementation

Li LIU1,2, Changbo CHEN1()   

  1. 1.Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China
    2.Chongqing School,University of Chinese Academy of Sciences,Chongqing 400714,China
  • Received:2022-11-21 Revised:2023-05-04 Accepted:2023-05-06 Online:2023-12-11 Published:2023-12-10
  • Contact: Changbo CHEN
  • About author:LIU Li, born in 1998, M. S. candidate. Her research interests include high-performance computing, deep learning.
  • Supported by:
    Youth Top-notch Talent Project of Chongqing Talent Plan(2021000263)

带状稀疏矩阵乘法及高效GPU实现

刘丽1,2, 陈长波1()   

  1. 1.中国科学院 重庆绿色智能技术研究院,重庆 400714
    2.中国科学院大学 重庆学院,重庆 400714
  • 通讯作者: 陈长波
  • 作者简介:刘丽(1998—),女,河南信阳人,硕士研究生,CCF会员,主要研究方向:高性能计算、深度学习;

Abstract:

Sparse-dense Matrix Multiplication (SpMM) is widely used in the fields such as scientific computing and deep learning, and it is of great importance to improve its efficiency. For a class of sparse matrices with band feature, a new storage format BRCV (Banded Row Column Value) and an SpMM algorithm based on this format as well as an efficient Graphics Processing Unit (GPU) implementation were proposed. Due to the fact that each sparse band can contain multiple sparse blocks, the proposed format can be seen as a generalization of the block sparse matrix format. Compared with the commonly used CSR (Compressed Sparse Row) format, BRCV format was able to significantly reduce the storage complexity by avoiding redundant storage of column indices in sparse bands. At the same time, the GPU implementation of SpMM based on BRCV format was able to make more efficient use of GPU’s shared memory and improve the computational efficiency of SpMM algorithm by reusing the rows of both sparse and dense matrices. For randomly generated band sparse matrices, experimental results on two different GPU platforms show that BRCV outperforms not only cuBLAS (CUDA Basic Linear Algebra Subroutines), but also cuSPARSE based on CSR and block sparse formats. Specifically, compared with cuSPARSE based on CSR format, BRCV has the maximum speedup ratio of 6.20 and 4.77 respectively. Moreover, the new implementation was applied to accelerate the SpMM operator in Graph Neural Network (GNN). Experimental results on real application datasets show that BRCV outperforms cuBLAS and cuSPARSE based on CSR format, also outperforms cuSPARSE based on block sparse format in most cases. In specific, compared with cuSPARSE based on CSR format, BRCV has the maximum speedup ratio reached 4.47. The above results indicate that BRCV can improve the efficiency of SpMM effectively.

Key words: band sparse matrix, sparse matrix storage format, sparse matrix multiplication, Graphics Processing Unit (GPU), shared memory

摘要:

稀疏-稠密矩阵乘法(SpMM)广泛应用于科学计算和深度学习等领域,提高它的效率具有重要意义。针对具有带状特征的一类稀疏矩阵,提出一种新的存储格式BRCV(Banded Row Column Value)以及基于此格式的SpMM算法和高效图形处理单元(GPU)实现。由于每个稀疏带可以包含多个稀疏块,所提格式可看成块稀疏矩阵格式的推广。相较于常用的CSR(Compressed Sparse Row)格式,BRCV格式通过避免稀疏带中列下标的冗余存储显著降低存储复杂度;同时,基于BRCV格式的SpMM的GPU实现通过同时复用稀疏和稠密矩阵的行更高效地利用GPU的共享内存,提升SpMM算法的计算效率。在两种不同GPU平台上针对随机生成的带状稀疏矩阵的实验结果显示,BRCV的性能不仅优于cuBLAS(CUDA Basic Linear Algebra Subroutines),也优于基于CSR和块稀疏两种不同格式的cuSPARSE。其中,相较于基于CSR格式的cuSPARSE,BRCV的最高加速比分别为6.20和4.77。此外,将新的实现应用于图神经网络(GNN)中的SpMM算子的加速。在实际应用数据集上的测试结果表明,BRCV的性能优于cuBLAS和基于CSR格式的cuSPARSE,且在大多数情况下优于基于块稀疏格式的cuSPARSE。其中,相较于基于CSR格式的cuSPARSE,BRCV的最高加速比为4.47。以上结果表明BRCV可以有效提升SpMM的效率。

关键词: 带状稀疏矩阵, 稀疏存储格式, 稀疏矩阵乘法, 图形处理单元, 共享内存

CLC Number: