《计算机应用》唯一官方网站 ›› 2020, Vol. 40 ›› Issue (2): 518-523.DOI: 10.11772/j.issn.1001-9081.2019091642

• 第七届CCF大数据学术会议 • 上一篇    下一篇

Spark下的分布式粗糙集属性约简算法

章夏杰1, 朱敬华1,2(), 陈杨1   

  1. 1.黑龙江大学 计算机科学技术学院,哈尔滨 150080
    2.黑龙江省数据库与并行计算重点实验室,哈尔滨 150080
  • 收稿日期:2019-08-30 修回日期:2019-09-26 接受日期:2019-10-18 发布日期:2019-10-31 出版日期:2020-02-10
  • 通讯作者: 朱敬华
  • 作者简介:章夏杰(1995—),男,浙江温州人,硕士研究生,主要研究方向:数据挖掘、粗糙集理论
    陈杨(1996—),男,重庆人,硕士研究生,主要研究方向:数据挖掘、离散化。
  • 基金资助:
    黑龙江省自然科学基金面上项目(F2018028)

Distributed rough set attribute reduction algorithm under Spark

Xiajie ZHANG1, Jinghua ZHU1,2(), Yang CHEN1   

  1. 1.School of Computer Science and Technology,Heilongjiang University,Harbin Heilongjiang 150080,China
    2.Key Laboratory of Database and Parallel Computing of Heilongjiang Province,Harbin Heilongjiang 150080,China
  • Received:2019-08-30 Revised:2019-09-26 Accepted:2019-10-18 Online:2019-10-31 Published:2020-02-10
  • Contact: Jinghua ZHU
  • About author:ZHANG Xiajie, born in 1995, M. S. candidate. His research interests include data mining, rough set theory.
    CHEN yang, born in 1996, M. S. candidate. His research interests include data mining, discretization.
  • Supported by:
    the Surface Program of Natural Science Foundation of Heilongjiang Province(F2018028)

摘要:

属性约简(特征选择)作为数据预处理的重要环节,大多以属性依赖作为筛选属性子集的标准。设计了一种快速依赖计算方法FDC,通过直接寻找基于相对正域的对象来计算依赖度,而不需要预先求出相对正域,相比传统方法在速度上有明显的性能提升。另外,改进鲸鱼优化算法(WOA)使其能够有效应用于粗糙集属性约简。结合上述两个方法,提出一种基于Spark的分布式粗糙集属性约简算法SP-WOFRST,并在两组人工合成的大数据集上与另一种基于Spark的粗糙集属性约简算法SP-RST进行对比实验。实验结果表明所提出的SP-WOFRST算法在精度和速度上均优于SP-RST。

关键词: 粗糙集, Apache Spark, 鲸鱼优化算法, 特征选择, 属性约简

Abstract:

Attribute reduction (feature selection) is an important part of data preprocessing. Most of attribute reduction methods use attribute dependence as the criterion for filtering attribute subsets. A Fast Dependence Calculation (FDC) method was designed to calculate the dependence by directly searching for the objects based on relative positive domains. It is not necessary to find the relative positive domain in advance, so that the method has a significant performance improvement in speed compared with the traditional methods. In addition, the Whale Optimization Algorithm (WOA) was improved to make the calculation method effective for rough set attribute reduction. Combining the above two methods, a distributed rough set attribute reduction algorithm based on Spark named SP-WOFRST was proposed, which was compared with a Spark-based rough set attribute reduction algorithm named SP-RST on two synthetical large data sets. Experimental results show that the proposed SP-WOFRST algorithm is superior to SP-RST in accuracy and speed.

Key words: rough set, Apache Spark, Whale Optimization Algorithm (WOA), feature selection, attribute reduction

中图分类号: