面向高性能应用的MPI大数据处理

doi:10.11772/j.issn.1001-9081.2018040771

计算机应用 ›› 2018, Vol. 38 ›› Issue (12): 3496-3499.DOI: 10.11772/j.issn.1001-9081.2018040771

面向高性能应用的MPI大数据处理

王鹏, 周岩

西南民族大学计算机科学与技术学院, 成都 610225

收稿日期:2018-04-16 修回日期:2018-06-11 出版日期:2018-12-10 发布日期:2018-12-15
通讯作者: 周岩
作者简介:王鹏(1975-),男,四川乐山人,教授,博士,CCF会员,主要研究方向:云计算、并行计算、量子计算;周岩(1976-),男,陕西西安人,硕士研究生,主要研究方向:云计算、并行计算、量子计算。
基金资助:
国家自然科学基金资助项目（60702075）；西南民族大学中央高校基本科研业务费专项（2017NZYQN27）；广东省科学技术厅2016年省科技发展专项资金项目（2016B090918062）；广州市2016年产学研协同创新重大专项（201604010115）。

MPI big data processing for high performance applications

WANG Peng, ZHOU Yan

School of Computer Science and Technology, Southwest Minzu University, Chengdu Sichuan 610225, China

Received:2018-04-16 Revised:2018-06-11 Online:2018-12-10 Published:2018-12-15
Contact: 周岩
Supported by:
This work is partially supported by the National Natural Science Foundation of China (60702075), the Fundamental Research Funds for the Central Universities of Southwest University for Nationalities (2017NZYQN27), the Guangdong Province Science and Technology Agency 2016 Provincial Special Fund for Science and Technology Development Project (2016B090918062), the 2016 Major Project of Collaborative Innovation in Production, Teaching and Research of Guangzhou (201604010115).

摘要/Abstract

摘要： 针对消息传递接口（MPI）在高性能计算领域的应用场景，为了优化MPI现有数据集中管理模式，增强其对大数据的处理能力，借鉴并行与分布式系统思想，开发设计一套适用于大数据处理的基于MPI的数据存储组件（MPI-DSP）。首先，创建接口函数，以对MPI系统影响最小的方式实现"计算向存储迁移"的设计目标，将文件分配与计算进行分离，使MPI突破大数据文件读取时的网络传输瓶颈。然后，分析阐述设计目标、运行机制、实现策略，通过描述接口函数MPI_Open在MPI环境下的应用，验证设计理念。通过Wordcount实验对比使用MPI-DSP组件与原MPI在数据文件处理方面的时间性能，初步验证了MPI"计算向存储迁移"模式的可行性，使其具备在高性能应用场景下的大数据处理能力。同时分析了MPI-DSP的适用环境和局限性，界定了其应用范围。

关键词: 消息传递接口, 并行计算, 大数据, 高性能计算, 数据存储插件

Abstract: In view of the application scenario of Message Passing Interface (MPI) in the field of high performance computing, in order to optimize the existing data centralized management model of MPI and enhance its processing capability for big data, a set of MPI Data Storage Plug-in (MPI-DSP) for large data processing was developed and designed by using the idea of parallel and distributed systems. Firstly, the interface function was created to achieve the design goal of "calculation to storage migration" in a way of minimizing the impact on MPI system. The file allocation and calculation were separated to make the MPI break through the bottleneck of network transmission when reading large data files. Then, the design goal, operation mechanism and implementation strategy were analyzed and elaborated. The design concept was verified by describing the application of interface function MPI_Open in MPI environment. By comparing the time performance of using MPI-DSP component with that of original MPI in data file processing through Wordcount experiment, the feasibility of MPI "computation to storage migration" mode was preliminarily validated, which enables that it has the large data processing capability in high performance application scenarios. At the same time, the applicable environment and limitations of MPI-DSP were analyzed, and its application scope was defined.

Key words: Message Passing Interface (MPI), parallel computing, big data, High Performance Computing (HPC), Data Storage Plugin (DSP)

中图分类号:

TP316.4

王鹏, 周岩. 面向高性能应用的MPI大数据处理[J]. 计算机应用, 2018, 38(12): 3496-3499.

WANG Peng, ZHOU Yan. MPI big data processing for high performance applications[J]. Journal of Computer Applications, 2018, 38(12): 3496-3499.

参考文献

[1] 臧大伟,曹政,孙凝晖.高性能计算的发展[J].科技导报,2016,34(14):22-28.(ZANG D W, CAO Z, SUN N H. The development of high-performance computing[J]. Science & Technology Review, 2016, 34(14):22-28.)
[2] 罗秋明,雷海军.一种高性价比的PVFS并行文件系统[J].微计算机信息,2006,22(8-1):243-245.(LUO Q M, LEI H J. A parallel file system based on PVFS with high performance and low cost[J]. Microcomputer Information, 2006, 22(8-1):243-245.)
[3] 王峰,雷葆华.Hadoop分布式文件系统的模型分析[J].电信科学,2010,26(12):95-99.(WANG F, LEI B H. Modeling and analysis of hadoop distributed file system[J]. Telecommunications Science, 2010, 26(12):95-99.)
[4] 潘巍,李战怀.大数据环境下并行计算模型的研究进展[J].华东师范大学学报(自然科学版),2014(5):43-54.(PAN W, LI Z H. Development of parallel computing models in the big data era[J]. Journal of East China Normal University (Natural Science), 2014(5):43-54.)
[5] 郭本俊,王鹏,陈高云,等.基于MPI的云计算模型[J].计算机工程,2009,35(24):84-86.(GUO B J, WANG P, CHEN G Y, et al. Cloud computing model based on MPI[J]. Computer Engineering, 2009, 35(24):84-86.)
[6] 陈国良.并行算法的设计与分析[M].北京:高等教育出版社,2009:53-54.(CHEN G L. Design and Analysis of Parallel Algorithms[M]. Beijing:Higher Education Press, 2009:53-54.)
[7] 王萃寒,赵晨,许小刚,等.分布式并行计算环境:MPI[J].计算机科学,2003,30(1):25-26.(WANG C H, ZHAO C, XU X G, et al. Distributed paralel computing environment:MPI[J]. Computer Science, 2003, 30(1):25-26.)
[8] 陈国良.并行计算机体系结构[M].北京:高等教育出版社,2002:62-64.(CHEN G L. Parallel Computer Architecture[M]. Beijing:Higher Education Press, 2002:62-64.)
[9] 崔丽青,徐炜民.MPI容错问题的研究及实现[J].计算机应用,2003,23(z2):236-238.(CUI L Q, XU W M. Research and practice of fault-tolerance in MPI[J]. Journal of Computer Applications, 2003, 23(z2):236-238.)
[10] 廖彬,于炯, 张陶,等.基于分布式文件系统HDSP的节能算法[J].计算机学报,2013,36(5):1047-1064.(LIAO B, YU J, ZHANG T, et al. Energy-efficient algorithms for distributed file system HDFS[J]. Chinese Journal of Computers, 2013, 36(5):1047-1064.)
[11] 胡柯,颉谭成,董秀林.基于TCP/IP和Socket的网络文件传送[J].河南科技大学学报(自然科学版),2003,24(4):53-56.(HU K, XIE T C, DONG X L. Network files transmission based on TCP/IP and Socket[J]. Journal of Henan University of Science and Technology (Natural Science), 2003, 24(4):53-56.)
[12] 蒋艳凰,赵强利,卢宇彤.异构环境下MPI通信技术研究[J].小型微型计算机系统,2009,30(9):1724-1729.(JIANG Y H, ZHAO Q L, LU Y T. Heterogeneity issues in MPI implementations[J]. Journal of Chinese Computer Systems, 2009, 30(9):1724-1729.)
[13] 金海,官象山,吴松,等.分布式存储系统中文件传输优化的设计与实现[J].华中科技大学学报(自然科学版),2005,33(1):4-6,9.(JIN H, GUAN X S, WU S, et al. The design of file transfer system with high speed and its implementation in GDSS[J]. Journal of Huazhong University of Science & Technology (Natural Science Edition), 2005, 33(1):4-6, 9.)
[14] 陈增强,郭嘉琳,刘忠信,等.具有断点续传功能的文件传输系统的设计与关键技术[J].计算机工程,2002,28(12):14-16.(CHEN Z Q, GUO J L, LIU Z X, et al. Design and key technology of the file-transfer system with the function of broken-point continuingly-transferring[J]. Computer Engineering, 2002, 28(12):14-16.)

[1]	解文博, 韦永壮, 刘争红. 基于CUDA的SKINNY加密算法并行实现与分析[J]. 计算机应用, 2021, 41(4): 1136-1141.
[2]	杨先凤, 贵红军, 傅春常. 统一计算设备架构下的F-X域预测滤波并行算法[J]. 计算机应用, 2021, 41(2): 486-491.
[3]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[4]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[5]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[6]	吴文莉, 刘国华, 张君宝. 大数据上函数查询解答的复杂度分析[J]. 计算机应用, 2020, 40(2): 416-419.
[7]	曾志阳, 陈燕, 王珂. 圆片下料并行遗传算法的设计与实现[J]. 计算机应用, 2020, 40(2): 392-397.
[8]	宋祥帅, 杨伏长, 谢江, 张武. Graphlet Degree Vector方法的优化与并行[J]. 计算机应用, 2020, 40(2): 398-403.
[9]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[10]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[11]	何希, 吴炎桃, 邸臻炜, 陈佳. 基于图形处理器的形态学重建系统[J]. 计算机应用, 2019, 39(7): 2008-2013.
[12]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[13]	龚鸣清, 叶煌, 张鉴, 卢兴敬, 陈伟. 基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化[J]. 计算机应用, 2019, 39(6): 1557-1562.
[14]	赵士操, 肖永浩, 段博文, 李于锋. HSWAP:适用于高性能计算环境的数值模拟工作流管理平台[J]. 计算机应用, 2019, 39(6): 1569-1576.
[15]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.

面向高性能应用的MPI大数据处理

MPI big data processing for high performance applications

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics