Distributed rough set attribute reduction algorithm under Spark

doi:10.11772/j.issn.1001-9081.2019091642

Abstract

Abstract:

Attribute reduction （feature selection） is an important part of data preprocessing. Most of attribute reduction methods use attribute dependence as the criterion for filtering attribute subsets. A Fast Dependence Calculation （FDC） method was designed to calculate the dependence by directly searching for the objects based on relative positive domains. It is not necessary to find the relative positive domain in advance， so that the method has a significant performance improvement in speed compared with the traditional methods. In addition， the Whale Optimization Algorithm （WOA） was improved to make the calculation method effective for rough set attribute reduction. Combining the above two methods， a distributed rough set attribute reduction algorithm based on Spark named SP-WOFRST was proposed， which was compared with a Spark-based rough set attribute reduction algorithm named SP-RST on two synthetical large data sets. Experimental results show that the proposed SP-WOFRST algorithm is superior to SP-RST in accuracy and speed.

Key words: rough set, Apache Spark, Whale Optimization Algorithm (WOA), feature selection, attribute reduction

摘要：

属性约简（特征选择）作为数据预处理的重要环节，大多以属性依赖作为筛选属性子集的标准。设计了一种快速依赖计算方法FDC，通过直接寻找基于相对正域的对象来计算依赖度，而不需要预先求出相对正域，相比传统方法在速度上有明显的性能提升。另外，改进鲸鱼优化算法（WOA）使其能够有效应用于粗糙集属性约简。结合上述两个方法，提出一种基于Spark的分布式粗糙集属性约简算法SP-WOFRST，并在两组人工合成的大数据集上与另一种基于Spark的粗糙集属性约简算法SP-RST进行对比实验。实验结果表明所提出的SP-WOFRST算法在精度和速度上均优于SP-RST。

关键词: 粗糙集, Apache Spark, 鲸鱼优化算法, 特征选择, 属性约简

CLC Number:

TP391

Xiajie ZHANG, Jinghua ZHU, Yang CHEN. Distributed rough set attribute reduction algorithm under Spark[J]. Journal of Computer Applications, 2020, 40(2): 518-523.

章夏杰, 朱敬华, 陈杨. Spark下的分布式粗糙集属性约简算法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 518-523.

Figures/Tables 10

References 16

1	CHEN M， MAO S， LIU Y. Big data： a survey［J］. Mobile Networks and Applications， 2014， 19（2）： 171-209. 10.1007/s11036-013-0489-0
2	FAN W， BIFET A. Mining big data： current status， and forecast to the future［J］. ACM SIGKDD Explorations Newsletter， 2013， 14（2）： 1-5. 10.1145/2481244.2481246
3	PAWLAK Z. Rough sets［J］. International Journal of Computer and Information Sciences， 1982， 11（5）： 341-356. 10.1007/bf01001956
4	HU X， CERCONE N. Learning in relational databases： a rough set approach［J］. Computational Intelligence， 1995， 11（2）： 323-338. 10.1111/j.1467-8640.1995.tb00035.x
5	HU K， LU Y， SHI C， et al. Feature ranking in rough sets［J］. AI Communications， 2003， 16（1）： 41-50. 10.1007/978-3-540-79721-0_17
6	WÓBLEWSKI J. Finding minimal reducts using genetic algorithms［C］// Proceedings of the 2nd Annual Join Conference on Information Sciences. Wrightsville Beach： Mendeley， 1995： 186-189.
7	JENSEN R， SHEN Q. Finding rough set reducts with ant colony optimization［C］// Proceedings of the 2003 UK Workshop on Computational Intelligence. Birmingham：［s.n.］， 2003：15-22.
8	WANG X， YANG J， TENG X， et al. Feature selection based on rough sets and particle swarm optimization［J］. Pattern Recognition Letters， 2007， 28（4）： 459-471. 10.1016/j.patrec.2006.09.003
9	DAGDIA Z C， ZARGES C， BECK G， et al. A distributed rough set theory based algorithm for an efficient big data pre-processing under the Spark framework［C］// Proceedings of the 2017 IEEE International Conference on Big Data. Piscataway： IEEE， 2017：911-916. 10.1109/bigdata.2017.8258008
10	王国胤. Rough集理论与知识获取［M］. 西安：西安交通大学出版社， 2001：12-70. 10.1002/stab.v70.12
	WANG G Y. Rough Set Theory and Knowledge Acquisition［M］. Xi’an： Xi’an Jiaotong University Press， 2001： 12-70. 10.1002/stab.v70.12
11	MIRJALILI S， LEWIS A. The whale optimization algorithm［J］. Advances in Engineering Software， 2016， 95： 51-67. 10.1016/j.advengsoft.2016.01.008
12	KENNEDY J， EBERHART R. Particle swarm optimization［C］// Proceedings of the 1995 IEEE International Conference on Neural Networks. Piscataway： IEEE， 1995： 1942-1948. 10.1109/icnn.1995.488968
13	MAFARJA M， MIRJALILI S. Whale optimization approaches for wrapper feature selection［J］. Applied Soft Computing， 2018， 62：441-453. 10.1016/j.asoc.2017.11.006
14	BELANCHE L A， GONZÁLEZ F F. Review and evaluation of feature selection algorithms in synthetic problems［EB/OL］. ［2019-01-21］. .
15	ZIARKO W. Variable precision rough set model［J］. Journal of Computer and System Sciences， 1993， 46（1）： 39-59. 10.1016/0022-0000(93)90048-2
16	王国胤，于洪，杨大春.基于条件信息熵的决策表约简［J］.计算机学报，2002，25（7）：759-766. 10.3321/j.issn:0254-4164.2002.07.013
	WANG G Y， YU H， YANG D C. Decision table reduction based on conditional information entropy［J］. Chinese Journal of Computers， 2002， 25（7）：759-766. 10.3321/j.issn:0254-4164.2002.07.013

个体编号	肤色	瞳色	出生地
X1	White	Blue	Europe
X2	White	Blue	Europe
X3	Yellow	Brown	Asia
X4	Black	Dark brown	Africa
X5	Yellow	Brown	Asia
X6	Black	Dark brown	Africa
X7	Black	Dark brown	America
X8	Black	Dark brown	America

个体编号	肤色	瞳色	出生地
X1	White	Blue	Europe
X2	White	Blue	Europe
X3	Yellow	Brown	Asia
X4	Black	Dark brown	Africa
X5	Yellow	Brown	Asia
X6	Black	Dark brown	Africa
X7	Black	Dark brown	America
X8	Black	Dark brown	America

编号	条件属性		决策类
1	White	Blue	1
2	White	Blue	1
3	Yellow	Brown	2
4	Black	Dark brown	3
5	Yellow	Brown	2
6	Black	Dark brown	3
7	Black	Dark brown	4
8	Black	Dark brown	4

编号	条件属性		决策类
1	White	Blue	1
2	White	Blue	1
3	Yellow	Brown	2
4	Black	Dark brown	3
5	Yellow	Brown	2
6	Black	Dark brown	3
7	Black	Dark brown	4
8	Black	Dark brown	4

编号	条件属性		决策类	标记
1	White	Blue	1	0
2	White	Blue	1	0
3	Yellow	Brown	2	0
4	Black	Dark brown	3	1
5	Yellow	Brown	2	0
6	Black	Dark brown	3	1
7	Black	Dark brown	4	1
8	Black	Dark brown	4	1