计算机应用 ›› 2019, Vol. 39 ›› Issue (1): 57-60.DOI: 10.11772/j.issn.1001-9081.2018071869

• 2018年全国开放式分布与并行计算学术年会(DPCS 2018)论文 • 上一篇    下一篇

基于相似连接的多源数据并行预处理方法

郭方方, 潮洛蒙, 朱建文   

  1. 哈尔滨工程大学 计算机科学与技术学院, 哈尔滨 150001
  • 收稿日期:2018-07-19 修回日期:2018-09-25 出版日期:2019-01-10 发布日期:2019-01-21
  • 通讯作者: 潮洛蒙
  • 作者简介:郭方方(1974-),男,黑龙江哈尔滨人,副教授,博士,主要研究方向:云计算、网络安全、P2P网络;潮洛蒙(1994-),男(蒙古族),内蒙古通辽人,硕士研究生,主要研究方向:网络安全、机器学习、数据分析;朱建文(1989-),男,四川成都人,硕士研究生,主要研究方向:网络安全、数据分析、安全态势感知。
  • 基金资助:
    国家科技重大专项(2016ZX03001023-005);国家级产学研合作项目(2016ZTE01-03-06);中央高校基本科研业务费专项(HEUCF100601)。

Multi-source data parallel preprocessing method based on similar connection

GUO Fangfang, CHAO Luomeng, ZHU Jianwen   

  1. School of Computer Science and Technology, Harbin Engineering University, Harbin Heilongjiang 150001, China
  • Received:2018-07-19 Revised:2018-09-25 Online:2019-01-10 Published:2019-01-21
  • Supported by:
    This work is partially supported by the National Science and Technology Major Project (2016ZX03001023-005), the National Industry-University-Research Cooperation Project (2016ZTE01-03-06), the Fundamental Research Funds for the Central Universities (HEUCF100601).

摘要: 大规模网络环境和大数据相关技术的发展对传统数据融合分析技术提出了新的挑战。针对目前多源数据融合分析过程灵活性差、处理效率低的问题,提出了一种基于相似连接的多源数据并行预处理方法,该方法采用了分治和并行的思想。首先,通过对多源数据中的相似语义进行统一、对个性语义进行保留的预处理方法提高了灵活性;其次,提出了一种改进的并行MapReduce框架,提高了相似连接的效率。实验结果表明,所提方法在保证数据完整性的基础上,使总的数据量减小了32%。与传统的MapReduce框架相比,改进后的框架在耗费时间方面减小了43.91%,因此该方法可以有效提高多源数据融合分析的效率。

关键词: 网络安全, 多源数据, 数据预处理, 相似连接, MapReduce

Abstract: With the development of large-scale network environments and big data-related technologies, traditional data fusion analysis technology faces new challenges. Focusing on poor flexibility and low processing efficiency in current multi-source data fusion analysis process, a multi-source data parallel preprocessing method based on similar connection was proposed, in which the idea of dividing and conquering and paralleling was adopted. Firstly, the preprocessing method was improved to increase the flexibility by unifying similar semantics in multi-source data and retaining personality semantics. Secondly, an improved parallel MapReduce framework was proposed to improve the efficiency of similar connections. The experimental results show that the proposed method reduces total data volume by 32% while ensuring data integrity. Compared with traditional MapReduce framework, the improved framework decreases 43.91% of time consumed; therefore, the proposed method can effectively improve the efficiency of multi-source data fusion analysis.

Key words: network security, multi-source data, data preprocessing, similar connection, MapReduce

中图分类号: