大规模生物网络马尔可夫聚类的并行化算法

doi:10.11772/j.issn.1001-9081.2018071660

计算机应用 ›› 2019, Vol. 39 ›› Issue (1): 66-71.DOI: 10.11772/j.issn.1001-9081.2018071660

• 2018年全国开放式分布与并行计算学术年会(DPCS 2018)论文 • 上一篇下一篇

大规模生物网络马尔可夫聚类的并行化算法

孙佳敏, 朱嘉富, 杨伏长, 谢江

上海大学计算机工程与科学学院, 上海 200444

收稿日期:2018-07-19 修回日期:2018-08-17 发布日期:2019-01-21 出版日期:2019-01-10
通讯作者: 谢江
作者简介:孙佳敏(1995-),女,江苏兴化人,硕士研究生,CCF会员,主要研究方向:生物信息学;朱嘉富(1997-),男,湖南衡阳人,主要研究方向:生物信息学;杨伏长(1994-),男,福建宁德人,硕士研究生,CCF会员,主要研究方向:生物信息学;谢江(1971-),女,湖北恩施人,副教授,博士,CCF高级会员,主要研究方向:生物信息学、高性能计算。
基金资助:
国家重点研发计划重点专项（2016YFC1401900）；上海市自然科学基金资助项目（17ZR1409900）。

Parallel algorithm of Markov clustering for large-scale biological networks

SUN Jiamin, ZHU Jiafu, YANG Fuzhang, XIE Jiang

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

Received:2018-07-19 Revised:2018-08-17 Online:2019-01-21 Published:2019-01-10
Supported by:
This work is partially supported by the National Key Research and Development Program of China (2016YFC1401900), the Natural Science Foundation of Shanghai (17ZR1409900).

摘要/Abstract

摘要： 马尔可夫聚类算法（MCL）是在大规模生物网络中寻找模块的一个有效方法，能够挖掘网络结构和功能影响力较大的模块。算法涉及到大规模矩阵计算，因此复杂度可达立方阶次。针对复杂度高的问题，提出了基于消息传递接口（MPI）的并行化马尔可夫聚类算法以提高算法的计算性能。首先，生物网络转化成邻接矩阵；然后，根据算法的特性，按照矩阵的规模判断并重新生成新矩阵以处理非平方倍数矩阵的计算；其次，并行计算通过按块分配的方式能够有效地实现任意规模矩阵的运算；最后，循环并行计算直至收敛，得到网络聚类结果。通过模拟网络和真实生物网络数据集的实验结果表明，与全块集体式通信（FCC）并行方法相比，平均并行效率提升了10个百分点以上，因此可以将该优化算法应用在不同类型的大规模生物网络中。

关键词: 消息传递接口, 并行化, 马尔可夫聚类, Cannon算法, 大规模生物网络

Abstract: Markov Clustering Algorithm (MCL) is an effective method to find modules in large-scale biological networks. It can mine modules that have significant influence on network structure and function. The algorithm involves large-scale matrix calculations, so its complexity can reach cubic orders. For the problem of high complexity, a parallel algorithm of Markov clustering based on Message Passing Interface (MPI) was proposed to improve computational performance of algorithm. Firstly, a biological network was transformed into an adjacency matrix. Secondly, according to the characteristics of the algorithm, the matrix size was judged and a new matrix was regenerated to handle the calculation of non-square multiple matrix. Thirdly, the algorithm was calculated in parallel by means of block allocation, which could effectively implement the operation of matrix of any size. Finally, the loop was parallelized until the matrix was converged to obtain network clustering results. The experimental results on simulated network and real biological network datasets show that compared with Full-block Collective Communication (FCC) parallel method, the average parallel efficiency is improved by more than 10 percentage points, so the optimization algorithm can be applied in different types of large-scale biological networks.

Key words: Message Passing Interface (MPI), parallelization, Markov clustering, Cannon algorithm, large-scale biological network

中图分类号:

TP301.6

孙佳敏, 朱嘉富, 杨伏长, 谢江. 大规模生物网络马尔可夫聚类的并行化算法[J]. 计算机应用, 2019, 39(1): 66-71.

SUN Jiamin, ZHU Jiafu, YANG Fuzhang, XIE Jiang. Parallel algorithm of Markov clustering for large-scale biological networks[J]. Journal of Computer Applications, 2019, 39(1): 66-71.

参考文献

[1] HOPKINS A L. Network pharmacology:the next paradigm in drug discovery[J]. Nature Chemical Biology, 2008, 4(11):682-690.
[2] 刘业政,周云龙.无尺度网络平均路径长度的估计[J].系统工程理论与实践,2014,34(6):1566-1571.(LIU Y Z, ZHOU Y L. Estimation for the average path length of scale-free networks[J]. Systems Engineering-Theory & Practice, 2014, 34(6):1566-1571.)
[3] WATTS D J,STROGATZ S H. Collective dynamics of ‘small-world’ networks[J]. Nature, 1998, 393(6684):440-442.
[4] 车宏安,顾基发.无标度网络及其系统科学意义[J].系统工程理论与实践,2004,24(4):11-16.(CHE H A, GU J F. Scale-free networks and their significance for systems science[J]. Systems Engineering-Theory & Practice, 2004, 24(4):11-16.)
[5] 黄海滨,杨路明,王建新,等.基于复合参数的蛋白质网络关键节点识别技术[J].自动化学报,2008,34(11):1388-1395.(HUANG H B, YANG L M, WANG J X, et al. Identification technique of essential nodes in protein networks based on combined parameters[J]. Acta Automatica Sinica, 2008, 34(11):1388-1395.)
[6] BADER G D, HOGUE C W V. An automated method for finding molecular complexes in large protein interaction networks[J]. BMC Bioinformatics, 2003, 4(1):2-27.
[7] GIRVAN M, NEWMAN M E J. Community structure in social and biological networks[J]. Proceedings of the National Academy of Sciences, 2002, 99(12):7821-7826.
[8] BEBHUR A, HORN D, SIEGELMANN H T, et al. Support vector clustering[J]. Journal of Machine Learning Research, 2001, 2(2):125-137.
[9] VAN DONGEN S M. Graph clustering by flow simulation[D]. Utrecht:University of Utrecht, 2000.
[10] CHEN G, ZHAO J, COHEN T, et al. Using ontology fingerprints to disambiguate gene name entities in the biomedical literature[J]. Database the Journal of Biological Databases & Curation, 2015, 2015(13):bav034.
[11] HE L, LU L, WANG Q. An optimal parallel implementation of Markov clustering based on the coordination of CPU and GPU[J]. Journal of Intelligent & Fuzzy Systems, 2017, 32(5):3609-3617.
[12] 王晓敏.基于蛋白质相互作用网络的功能模块识别及功能预测研究[D].长沙:国防科学技术大学,2013.(WANG X M. The detection of functional modules and protein function prediction based on protein-protein interaction networks[D]. Changsha:National University of Defense Technology, 2013.)
[13] WANG M, ZHANG W, DING W, et al. Parallel clustering algorithm for large-scale biological data sets[J]. PLoS One, 2014, 9(4):e91315.
[14] AZAD A, PAVLOPOULOS G A, OUZOUNIS C A, et al. HipMCL:a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks[J]. Nucleic Acids Research, 2018, 46(6):1-11.
[15] GASPARINI M. Markov chain Monte Carlo in practice[J]. Technometrics, 1999, 39(3):338-338.
[16] BUSTAMAM A, SEHGAL M S, WONG S, et al. Parallel Markov clustering for large-scale protein-protein interaction networks using MPI[EB/OL].[2018-05-30]. https://pdfs.semanticscholar.org/c62d/5b5abc12a3fe566fce668974436e7cdd273e.pdf.
[17] 蒋瀚洋.论Cannon算法在并行计算机上的运用研究[J].计算机光盘软件与应用,2012(20):154-155.(JIANG H Y. Research on application of Cannon algorithm in parallel computer[J]. Computer CD Software and Applications, 2012(20):154-155.)
[18] 陈鹏,樊小超.几种矩阵乘并行算法的对比分析[J].新疆师范大学学报(自然科学版),2012,31(3):5-10.(CHEN P, FAN X C. Several kinds of parallel algorithm for matrix multiplication comparative analysis[J]. Journal of Xinjiang Normal University (Natural Sciences Edition), 2012, 31(3):5-10.)
[19] HAGBERG A, SCHULT D, SWART P. Exploring network structure, dynamics, and function using NetworkX[R]. Los Alamos, NM:Los Alamos National Lab, 2008.
[20] BARABÁSI A, ALBERT R. Emergence of scaling in random networks[J]. Science, 1999, 286(5439):509-512.
[21] KIM J, VU V. Generating random regular graphs[C]//Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing. New York:ACM, 2003:213-222.
[22] MILO R, SHEN-ORR S, ITZKOVITZ S, et al. Network motifs:simple building blocks of complex networks[J]. Science, 2002, 298(5594):824-827.
[23] SNEL B, LEHMANN G, BORK P, et al. STRING:a Web-server to retrieve and display the repeatedly occurring neighbourhood of a gene[J]. Nucleic Acids Research, 2000, 28(18):3442-3444.

大规模生物网络马尔可夫聚类的并行化算法

Parallel algorithm of Markov clustering for large-scale biological networks

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	王周恺, 张炯, 马维纲, 王怀军. 面向高速列车监测数据的并行解压缩算法[J]. 计算机应用, 2021, 41(9): 2586-2593.
[2]	蒋林, 施佳琪, 李远成. 可重构结构下合成视点失真变化算法并行设计与实现[J]. 计算机应用, 2021, 41(6): 1734-1740.
[3]	宋祥帅, 杨伏长, 谢江, 张武. Graphlet Degree Vector方法的优化与并行[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 398-403.
[4]	郭良敏, 朱莹, 孙丽萍. 障碍空间中基于并行蚁群算法的k近邻查询[J]. 计算机应用, 2019, 39(3): 790-795.
[5]	李龙洋, 董一鸿, 施炜杰, 潘剑飞. SQM:基于Spark的大规模单图上的子图匹配算法[J]. 计算机应用, 2019, 39(1): 46-50.
[6]	杨伏长, 朱嘉富, 孙佳敏, 谢江. 生物复杂网络motif发现的并行算法[J]. 计算机应用, 2019, 39(1): 72-77.
[7]	崔晨, 郑林江, 韩凤萍, 何牧君. 基于内存的HBase二级索引设计[J]. 计算机应用, 2018, 38(6): 1584-1590.
[8]	王鹏, 周岩. 面向高性能应用的MPI大数据处理[J]. 计算机应用, 2018, 38(12): 3496-3499.
[9]	张承畅, 张华誉, 罗建昌, 何丰. 基于云计算和改进K-means算法的海量用电数据分析方法[J]. 计算机应用, 2018, 38(1): 159-164.
[10]	刘有耀, 杨鹏程. 基于JavaCC的C代码自动并行化的设计与实现[J]. 计算机应用, 2016, 36(9): 2422-2426.
[11]	韩逢庆, 宋志坚, 余锐. 海量图片快速去重技术[J]. 计算机应用, 2016, 36(7): 1797-1800.
[12]	曾雪琳, 吴斌. 基于位置的社会化网络的并行化推荐算法[J]. 计算机应用, 2016, 36(2): 316-323.
[13]	林炀, 江育娥, 林劼. 基于分布式架构的时间序列局部相似检测算法[J]. 计算机应用, 2016, 36(12): 3285-3291.
[14]	于苹苹, 倪建成, 姚彬修, 李淋淋, 曹博. 基于Spark框架的高效KNN中文文本分类算法[J]. 计算机应用, 2016, 36(12): 3292-3297.
[15]	陈珍, 夏靖波, 杨娟, 韦泽鲲. 基于MapReduce的支持向量机态势评估算法[J]. 计算机应用, 2016, 36(1): 133-137.