计算机应用 ›› 2019, Vol. 39 ›› Issue (1): 66-71.DOI: 10.11772/j.issn.1001-9081.2018071660

• 2018年全国开放式分布与并行计算学术年会(DPCS 2018)论文 • 上一篇    下一篇

大规模生物网络马尔可夫聚类的并行化算法

孙佳敏, 朱嘉富, 杨伏长, 谢江   

  1. 上海大学 计算机工程与科学学院, 上海 200444
  • 收稿日期:2018-07-19 修回日期:2018-08-17 出版日期:2019-01-10 发布日期:2019-01-21
  • 通讯作者: 谢江
  • 作者简介:孙佳敏(1995-),女,江苏兴化人,硕士研究生,CCF会员,主要研究方向:生物信息学;朱嘉富(1997-),男,湖南衡阳人,主要研究方向:生物信息学;杨伏长(1994-),男,福建宁德人,硕士研究生,CCF会员,主要研究方向:生物信息学;谢江(1971-),女,湖北恩施人,副教授,博士,CCF高级会员,主要研究方向:生物信息学、高性能计算。
  • 基金资助:
    国家重点研发计划重点专项(2016YFC1401900);上海市自然科学基金资助项目(17ZR1409900)。

Parallel algorithm of Markov clustering for large-scale biological networks

SUN Jiamin, ZHU Jiafu, YANG Fuzhang, XIE Jiang   

  1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
  • Received:2018-07-19 Revised:2018-08-17 Online:2019-01-10 Published:2019-01-21
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2016YFC1401900), the Natural Science Foundation of Shanghai (17ZR1409900).

摘要: 马尔可夫聚类算法(MCL)是在大规模生物网络中寻找模块的一个有效方法,能够挖掘网络结构和功能影响力较大的模块。算法涉及到大规模矩阵计算,因此复杂度可达立方阶次。针对复杂度高的问题,提出了基于消息传递接口(MPI)的并行化马尔可夫聚类算法以提高算法的计算性能。首先,生物网络转化成邻接矩阵;然后,根据算法的特性,按照矩阵的规模判断并重新生成新矩阵以处理非平方倍数矩阵的计算;其次,并行计算通过按块分配的方式能够有效地实现任意规模矩阵的运算;最后,循环并行计算直至收敛,得到网络聚类结果。通过模拟网络和真实生物网络数据集的实验结果表明,与全块集体式通信(FCC)并行方法相比,平均并行效率提升了10个百分点以上,因此可以将该优化算法应用在不同类型的大规模生物网络中。

关键词: 消息传递接口, 并行化, 马尔可夫聚类, Cannon算法, 大规模生物网络

Abstract: Markov Clustering Algorithm (MCL) is an effective method to find modules in large-scale biological networks. It can mine modules that have significant influence on network structure and function. The algorithm involves large-scale matrix calculations, so its complexity can reach cubic orders. For the problem of high complexity, a parallel algorithm of Markov clustering based on Message Passing Interface (MPI) was proposed to improve computational performance of algorithm. Firstly, a biological network was transformed into an adjacency matrix. Secondly, according to the characteristics of the algorithm, the matrix size was judged and a new matrix was regenerated to handle the calculation of non-square multiple matrix. Thirdly, the algorithm was calculated in parallel by means of block allocation, which could effectively implement the operation of matrix of any size. Finally, the loop was parallelized until the matrix was converged to obtain network clustering results. The experimental results on simulated network and real biological network datasets show that compared with Full-block Collective Communication (FCC) parallel method, the average parallel efficiency is improved by more than 10 percentage points, so the optimization algorithm can be applied in different types of large-scale biological networks.

Key words: Message Passing Interface (MPI), parallelization, Markov clustering, Cannon algorithm, large-scale biological network

中图分类号: