Distributed rough set attribute reduction algorithm under Spark
ZHANG Xiajie1, ZHU Jinghua1,2, CHEN Yang1
1. School of Computer Science and Technology, Heilongjiang University, Harbin Heilongjiang 150080, China; 2. Key Laboratory of Database and Parallel Computing of Heilongjiang Province, Harbin Heilongjiang 150080, China
Abstract:Attribute reduction (feature selection) is an important part of data preprocessing. Most of attribute reduction methods use attribute dependence as the criterion for filtering attribute subsets. A Fast Dependence Calculation (FDC) method was designed to calculate the dependence by directly searching for the objects based on relative positive domains. It is not necessary to find the relative positive domain in advance, so that the method has a significant performance improvement in speed compared with the traditional methods. In addition, the Whale Optimization Algorithm (WOA) was improved to make the calculation method effective for rough set attribute reduction. Combining the above two methods, a distributed rough set attribute reduction algorithm based on Spark named SP-WOFRST was proposed, which was compared with a Spark-based rough set attribute reduction algorithm named SP-RST on two synthetical large data sets. Experimental results show that the proposed SP-WOFRST algorithm is superior to SP-RST in accuracy and speed.
[1] CHEN M,MAO S,LIU Y. Big data:a survey[J]. Mobile Networks and Applications,2014,19(2):171-209. [2] FAN W,BIFET A. Mining big data:current status,and forecast to the future[J]. ACM SIGKDD Explorations Newsletter,2013,14(2):1-5. [3] PAWLAK Z. Rough sets[J]. International Journal of Computer and Information Sciences,1982,11(5):341-356. [4] HU X,CERCONE N. Learning in relational databases:a rough set approach[J]. Computational Intelligence,1995,11(2):323-338. [5] HU K,LU Y,SHI C,et al. Feature ranking in rough sets[J]. AI Communications,2003,16(1):41-50. [6] WÓBLEWSKI J. Finding minimal reducts using genetic algorithms[C]//Proceedings of the 2nd Annual Join Conference on Information Sciences. Wrightsville Beach:Mendeley,1995:186-189. [7] JENSEN R,SHEN Q. Finding rough set reducts with ant colony optimization[C]//Proceedings of the 2003 UK Workshop on Computational Intelligence. Birmingham:[s. n.],2003:15-22. [8] WANG X,YANG J,TENG X,et al. Feature selection based on rough sets and particle swarm optimization[J]. Pattern Recognition Letters,2007,28(4):459-471. [9] DAGDIA Z C,ZARGES C,BECK G,et al. A distributed rough set theory based algorithm for an efficient big data pre-processing under the Spark framework[C]//Proceedings of the 2017 IEEE International Conference on Big Data. Piscataway:IEEE,2017:911-916. [10] 王国胤. Rough集理论与知识获取[M]. 西安:西安交通大学出版社,2001:12-70. (WANG G Y. Rough Set Theory and Knowledge Acquisition[M]. Xi'an:Xi'an Jiaotong University Press,2001:12-70.) [11] MIRJALILI S,LEWIS A. The whale optimization algorithm[J]. Advances in Engineering Software,2016,95:51-67. [12] KENNEDY J,EBERHART R. Particle swarm optimization[C]//Proceedings of the 1995 IEEE International Conference on Neural Networks. Piscataway:IEEE,1995:1942-1948. [13] MAFARJA M,MIRJALILI S. Whale optimization approaches for wrapper feature selection[J]. Applied Soft Computing,2018,62:441-453. [14] BELANCHE L A,GONZÁLEZ F F. Review and evaluation of feature selection algorithms in synthetic problems[EB/OL].[2019-01-21]. https://arxiv.org/pdf/1101.2320.pdf. [15] ZIARKO W. Variable precision rough set model[J]. Journal of Computer and System Sciences,1993,46(1):39-59. [16] 王国胤, 于洪, 杨大春. 基于条件信息熵的决策表约简[J]. 计算机学报, 2002, 25(7):759-766.(WANG G Y,YU H,YANG D C. Decision table reduction based on conditional information entropy[J]. Chinese Journal of Computers,2002,25(7):759-766.)