Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (2): 518-523.DOI: 10.11772/j.issn.1001-9081.2019091642
• CCF Bigdata 2019 • Previous Articles Next Articles
Xiajie ZHANG1, Jinghua ZHU1,2(), Yang CHEN1
Received:
2019-08-30
Revised:
2019-09-26
Accepted:
2019-10-18
Online:
2019-10-31
Published:
2020-02-10
Contact:
Jinghua ZHU
About author:
ZHANG Xiajie, born in 1995, M. S. candidate. His research interests include data mining, rough set theory.Supported by:
通讯作者:
朱敬华
作者简介:
章夏杰(1995—),男,浙江温州人,硕士研究生,主要研究方向:数据挖掘、粗糙集理论基金资助:
CLC Number:
Xiajie ZHANG, Jinghua ZHU, Yang CHEN. Distributed rough set attribute reduction algorithm under Spark[J]. Journal of Computer Applications, 2020, 40(2): 518-523.
章夏杰, 朱敬华, 陈杨. Spark下的分布式粗糙集属性约简算法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 518-523.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2019091642
个体编号 | 肤色 | 瞳色 | 出生地 |
---|---|---|---|
X1 | White | Blue | Europe |
X2 | White | Blue | Europe |
X3 | Yellow | Brown | Asia |
X4 | Black | Dark brown | Africa |
X5 | Yellow | Brown | Asia |
X6 | Black | Dark brown | Africa |
X7 | Black | Dark brown | America |
X8 | Black | Dark brown | America |
Tab. 1 Decision table
个体编号 | 肤色 | 瞳色 | 出生地 |
---|---|---|---|
X1 | White | Blue | Europe |
X2 | White | Blue | Europe |
X3 | Yellow | Brown | Asia |
X4 | Black | Dark brown | Africa |
X5 | Yellow | Brown | Asia |
X6 | Black | Dark brown | Africa |
X7 | Black | Dark brown | America |
X8 | Black | Dark brown | America |
编号 | 条件属性 | 决策类 | 标记 | |
---|---|---|---|---|
1 | White | Blue | 1 | 0 |
2 | White | Blue | 1 | 0 |
3 | Yellow | Brown | 2 | 0 |
4 | Black | Dark brown | 3 | 0 |
5 | Yellow | Brown | 2 | 0 |
6 | Black | Dark brown | 3 | 0 |
7 | Black | Dark brown | 4 | 0 |
8 | Black | Dark brown | 4 | 0 |
Tab. 2 Results after preprocessing for Tab. 1
编号 | 条件属性 | 决策类 | 标记 | |
---|---|---|---|---|
1 | White | Blue | 1 | 0 |
2 | White | Blue | 1 | 0 |
3 | Yellow | Brown | 2 | 0 |
4 | Black | Dark brown | 3 | 0 |
5 | Yellow | Brown | 2 | 0 |
6 | Black | Dark brown | 3 | 0 |
7 | Black | Dark brown | 4 | 0 |
8 | Black | Dark brown | 4 | 0 |
编号 | 条件属性 | 决策类 | 标记 | |
---|---|---|---|---|
1 | White | Blue | 1 | 0 |
2 | White | Blue | 1 | 0 |
3 | Yellow | Brown | 2 | 0 |
4 | Black | Dark brown | 3 | 1 |
5 | Yellow | Brown | 2 | 0 |
6 | Black | Dark brown | 3 | 1 |
7 | Black | Dark brown | 4 | 1 |
8 | Black | Dark brown | 4 | 1 |
Tab. 3 Results after step 2
编号 | 条件属性 | 决策类 | 标记 | |
---|---|---|---|---|
1 | White | Blue | 1 | 0 |
2 | White | Blue | 1 | 0 |
3 | Yellow | Brown | 2 | 0 |
4 | Black | Dark brown | 3 | 1 |
5 | Yellow | Brown | 2 | 0 |
6 | Black | Dark brown | 3 | 1 |
7 | Black | Dark brown | 4 | 1 |
8 | Black | Dark brown | 4 | 1 |
数据集 | 实例数 | 属性数 | 类型 | 分类数 |
---|---|---|---|---|
arsds1 | 3 000 | 500 | Integer | 25 |
arsds2 | 3 000 | 500 | Integer | 50 |
Table 4 Experimental datasets
数据集 | 实例数 | 属性数 | 类型 | 分类数 |
---|---|---|---|---|
arsds1 | 3 000 | 500 | Integer | 25 |
arsds2 | 3 000 | 500 | Integer | 50 |
特征数/分区 | 迭代总次数 | 约简大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
2 | 180 | 89 | 46 | 43 | 4 | 407 | 0.92 | 0.517 | 0.662 | 198 |
200 | 71 | 43 | 28 | 7 | 422 | 0.86 | 0.606 | 0.711 | 222 | |
4 | 80 | 65 | 39 | 26 | 11 | 424 | 0.78 | 0.600 | 0.678 | 133 |
90 | 49 | 33 | 16 | 17 | 434 | 0.66 | 0.673 | 0.667 | 150 | |
6 | 40 | 75 | 40 | 35 | 10 | 415 | 0.80 | 0.533 | 0.640 | 106 |
50 | 46 | 31 | 15 | 19 | 435 | 0.62 | 0.674 | 0.646 | 133 | |
8 | 25 | 63 | 39 | 24 | 11 | 426 | 0.78 | 0.619 | 0.690 | 341 |
30 | 59 | 35 | 24 | 15 | 426 | 0.70 | 0.593 | 0.642 | 412 | |
10 | 20 | 99 | 46 | 53 | 4 | 397 | 0.92 | 0.465 | 0.617 | 1 133 |
30 | 41 | 31 | 10 | 19 | 440 | 0.62 | 0.756 | 0.681 | 1 700 |
Tab. 5 Performance of SP-RST on arsds1 dataset with different parameter settings
特征数/分区 | 迭代总次数 | 约简大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
2 | 180 | 89 | 46 | 43 | 4 | 407 | 0.92 | 0.517 | 0.662 | 198 |
200 | 71 | 43 | 28 | 7 | 422 | 0.86 | 0.606 | 0.711 | 222 | |
4 | 80 | 65 | 39 | 26 | 11 | 424 | 0.78 | 0.600 | 0.678 | 133 |
90 | 49 | 33 | 16 | 17 | 434 | 0.66 | 0.673 | 0.667 | 150 | |
6 | 40 | 75 | 40 | 35 | 10 | 415 | 0.80 | 0.533 | 0.640 | 106 |
50 | 46 | 31 | 15 | 19 | 435 | 0.62 | 0.674 | 0.646 | 133 | |
8 | 25 | 63 | 39 | 24 | 11 | 426 | 0.78 | 0.619 | 0.690 | 341 |
30 | 59 | 35 | 24 | 15 | 426 | 0.70 | 0.593 | 0.642 | 412 | |
10 | 20 | 99 | 46 | 53 | 4 | 397 | 0.92 | 0.465 | 0.617 | 1 133 |
30 | 41 | 31 | 10 | 19 | 440 | 0.62 | 0.756 | 0.681 | 1 700 |
特征数/分区 | 迭代总次数 | 约简 大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
4 | 100 | 379 | 233 | 146 | 17 | 104 | 0.932 | 0.615 | 0.741 | 117 |
200 | 286 | 195 | 91 | 55 | 159 | 0.78 | 0.682 | 0.728 | 240 | |
300 | 211 | 164 | 47 | 86 | 203 | 0.656 | 0.777 | 0.711 | 354 | |
6 | 60 | 289 | 236 | 53 | 14 | 197 | 0.944 | 0.817 | 0.876 | 266 |
80 | 241 | 202 | 39 | 48 | 211 | 0.808 | 0.838 | 0.823 | 354 | |
100 | 199 | 156 | 43 | 94 | 207 | 0.624 | 0.784 | 0.695 | 445 | |
8 | 30 | 322 | 250 | 72 | 0 | 178 | 1.000 | 0.776 | 0.874 | 430 |
40 | 280 | 234 | 46 | 16 | 204 | 0.936 | 0.836 | 0.883 | 577 | |
50 | 240 | 174 | 66 | 76 | 184 | 0.696 | 0.725 | 0.710 | 716 |
Tab. 6 Performance of SP-RST on arsds2 dataset with different parameter settings
特征数/分区 | 迭代总次数 | 约简 大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
4 | 100 | 379 | 233 | 146 | 17 | 104 | 0.932 | 0.615 | 0.741 | 117 |
200 | 286 | 195 | 91 | 55 | 159 | 0.78 | 0.682 | 0.728 | 240 | |
300 | 211 | 164 | 47 | 86 | 203 | 0.656 | 0.777 | 0.711 | 354 | |
6 | 60 | 289 | 236 | 53 | 14 | 197 | 0.944 | 0.817 | 0.876 | 266 |
80 | 241 | 202 | 39 | 48 | 211 | 0.808 | 0.838 | 0.823 | 354 | |
100 | 199 | 156 | 43 | 94 | 207 | 0.624 | 0.784 | 0.695 | 445 | |
8 | 30 | 322 | 250 | 72 | 0 | 178 | 1.000 | 0.776 | 0.874 | 430 |
40 | 280 | 234 | 46 | 16 | 204 | 0.936 | 0.836 | 0.883 | 577 | |
50 | 240 | 174 | 66 | 76 | 184 | 0.696 | 0.725 | 0.710 | 716 |
种群个体总数 | 迭代总次数 | 约简大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
12 | 200 | 72 | 50 | 22 | 0 | 428 | 1 | 0.694 | 0.820 | 206 |
300 | 55 | 50 | 5 | 0 | 445 | 1 | 0.909 | 0.952 | 295 | |
400 | 52 | 50 | 2 | 0 | 448 | 1 | 0.962 | 0.980 | 375 | |
24 | 100 | 62 | 50 | 12 | 0 | 438 | 1 | 0.806 | 0.893 | 189 |
200 | 54 | 50 | 4 | 0 | 446 | 1 | 0.926 | 0.962 | 350 | |
300 | 51 | 50 | 1 | 0 | 449 | 1 | 0.980 | 0.990 | 512 | |
36 | 100 | 57 | 50 | 7 | 0 | 443 | 1 | 0.877 | 0.935 | 313 |
150 | 54 | 50 | 4 | 0 | 446 | 1 | 0.926 | 0.962 | 466 | |
200 | 52 | 50 | 2 | 0 | 448 | 1 | 0.962 | 0.980 | 589 |
Tab. 7 Performance of SP-WOFRST on arsds1 dataset with different parameter settings
种群个体总数 | 迭代总次数 | 约简大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
12 | 200 | 72 | 50 | 22 | 0 | 428 | 1 | 0.694 | 0.820 | 206 |
300 | 55 | 50 | 5 | 0 | 445 | 1 | 0.909 | 0.952 | 295 | |
400 | 52 | 50 | 2 | 0 | 448 | 1 | 0.962 | 0.980 | 375 | |
24 | 100 | 62 | 50 | 12 | 0 | 438 | 1 | 0.806 | 0.893 | 189 |
200 | 54 | 50 | 4 | 0 | 446 | 1 | 0.926 | 0.962 | 350 | |
300 | 51 | 50 | 1 | 0 | 449 | 1 | 0.980 | 0.990 | 512 | |
36 | 100 | 57 | 50 | 7 | 0 | 443 | 1 | 0.877 | 0.935 | 313 |
150 | 54 | 50 | 4 | 0 | 446 | 1 | 0.926 | 0.962 | 466 | |
200 | 52 | 50 | 2 | 0 | 448 | 1 | 0.962 | 0.980 | 589 |
种群个体总数 | 迭代总次数 | 约简大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
12 | 300 | 342 | 250 | 14 | 0 | 236 | 1 | 0.947 | 0.973 | 342 |
400 | 294 | 250 | 6 | 0 | 244 | 1 | 0.977 | 0.988 | 513 | |
600 | 257 | 250 | 0 | 0 | 250 | 1 | 1.000 | 1.000 | 640 | |
24 | 200 | 295 | 250 | 12 | 0 | 238 | 1 | 0.954 | 0.977 | 278 |
300 | 271 | 250 | 5 | 0 | 245 | 1 | 0.980 | 0.990 | 530 | |
400 | 255 | 250 | 0 | 0 | 250 | 1 | 1.000 | 1.000 | 800 | |
36 | 100 | 314 | 250 | 33 | 0 | 217 | 1 | 0.883 | 0.938 | 462 |
200 | 271 | 250 | 6 | 0 | 244 | 1 | 0.977 | 0.988 | 663 | |
300 | 255 | 250 | 0 | 0 | 250 | 1 | 1.000 | 1.000 | 897 |
Tab. 8 Performance of SP-WOFRST on arsds2 dataset with different parameter settings
种群个体总数 | 迭代总次数 | 约简大小 | TP | FN | FP | TN | 查准率 | 查全率 | F1 | 时间/s |
---|---|---|---|---|---|---|---|---|---|---|
12 | 300 | 342 | 250 | 14 | 0 | 236 | 1 | 0.947 | 0.973 | 342 |
400 | 294 | 250 | 6 | 0 | 244 | 1 | 0.977 | 0.988 | 513 | |
600 | 257 | 250 | 0 | 0 | 250 | 1 | 1.000 | 1.000 | 640 | |
24 | 200 | 295 | 250 | 12 | 0 | 238 | 1 | 0.954 | 0.977 | 278 |
300 | 271 | 250 | 5 | 0 | 245 | 1 | 0.980 | 0.990 | 530 | |
400 | 255 | 250 | 0 | 0 | 250 | 1 | 1.000 | 1.000 | 800 | |
36 | 100 | 314 | 250 | 33 | 0 | 217 | 1 | 0.883 | 0.938 | 462 |
200 | 271 | 250 | 6 | 0 | 244 | 1 | 0.977 | 0.988 | 663 | |
300 | 255 | 250 | 0 | 0 | 250 | 1 | 1.000 | 1.000 | 897 |
1 | CHEN M, MAO S, LIU Y. Big data: a survey[J]. Mobile Networks and Applications, 2014, 19(2): 171-209. 10.1007/s11036-013-0489-0 |
2 | FAN W, BIFET A. Mining big data: current status, and forecast to the future[J]. ACM SIGKDD Explorations Newsletter, 2013, 14(2): 1-5. 10.1145/2481244.2481246 |
3 | PAWLAK Z. Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11(5): 341-356. 10.1007/bf01001956 |
4 | HU X, CERCONE N. Learning in relational databases: a rough set approach[J]. Computational Intelligence, 1995, 11(2): 323-338. 10.1111/j.1467-8640.1995.tb00035.x |
5 | HU K, LU Y, SHI C, et al. Feature ranking in rough sets[J]. AI Communications, 2003, 16(1): 41-50. 10.1007/978-3-540-79721-0_17 |
6 | WÓBLEWSKI J. Finding minimal reducts using genetic algorithms[C]// Proceedings of the 2nd Annual Join Conference on Information Sciences. Wrightsville Beach: Mendeley, 1995: 186-189. |
7 | JENSEN R, SHEN Q. Finding rough set reducts with ant colony optimization[C]// Proceedings of the 2003 UK Workshop on Computational Intelligence. Birmingham: [s.n.], 2003:15-22. |
8 | WANG X, YANG J, TENG X, et al. Feature selection based on rough sets and particle swarm optimization[J]. Pattern Recognition Letters, 2007, 28(4): 459-471. 10.1016/j.patrec.2006.09.003 |
9 | DAGDIA Z C, ZARGES C, BECK G, et al. A distributed rough set theory based algorithm for an efficient big data pre-processing under the Spark framework[C]// Proceedings of the 2017 IEEE International Conference on Big Data. Piscataway: IEEE, 2017:911-916. 10.1109/bigdata.2017.8258008 |
10 | 王国胤. Rough集理论与知识获取[M]. 西安:西安交通大学出版社, 2001:12-70. 10.1002/stab.v70.12 |
WANG G Y. Rough Set Theory and Knowledge Acquisition[M]. Xi’an: Xi’an Jiaotong University Press, 2001: 12-70. 10.1002/stab.v70.12 | |
11 | MIRJALILI S, LEWIS A. The whale optimization algorithm[J]. Advances in Engineering Software, 2016, 95: 51-67. 10.1016/j.advengsoft.2016.01.008 |
12 | KENNEDY J, EBERHART R. Particle swarm optimization[C]// Proceedings of the 1995 IEEE International Conference on Neural Networks. Piscataway: IEEE, 1995: 1942-1948. 10.1109/icnn.1995.488968 |
13 | MAFARJA M, MIRJALILI S. Whale optimization approaches for wrapper feature selection[J]. Applied Soft Computing, 2018, 62:441-453. 10.1016/j.asoc.2017.11.006 |
14 | BELANCHE L A, GONZÁLEZ F F. Review and evaluation of feature selection algorithms in synthetic problems[EB/OL]. [2019-01-21]. . |
15 | ZIARKO W. Variable precision rough set model[J]. Journal of Computer and System Sciences, 1993, 46(1): 39-59. 10.1016/0022-0000(93)90048-2 |
16 | 王国胤,于洪,杨大春.基于条件信息熵的决策表约简[J].计算机学报,2002,25(7):759-766. 10.3321/j.issn:0254-4164.2002.07.013 |
WANG G Y, YU H, YANG D C. Decision table reduction based on conditional information entropy[J]. Chinese Journal of Computers, 2002, 25(7):759-766. 10.3321/j.issn:0254-4164.2002.07.013 |
[1] | Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499. |
[2] | Mingzhu LEI, Hao WANG, Rong JIA, Lin BAI, Xiaoying PAN. Oversampling algorithm based on synthesizing minority class samples using relationship between features [J]. Journal of Computer Applications, 2024, 44(5): 1428-1436. |
[3] | Lin GAO, Yu ZHOU, Tak Wu KWONG. Evolutionary bi-level adaptive local feature selection [J]. Journal of Computer Applications, 2024, 44(5): 1408-1414. |
[4] | Lin SUN, Menghan LIU. K-means clustering based on adaptive cuckoo optimization feature selection [J]. Journal of Computer Applications, 2024, 44(3): 831-841. |
[5] | Dapeng XU, Xinmin HOU. Feature selection method for graph neural network based on network architecture design [J]. Journal of Computer Applications, 2024, 44(3): 663-670. |
[6] | Shengjie MENG, Wanjun YU, Ying CHEN. Feature selection algorithm for high-dimensional data with maximum correlation and maximum difference [J]. Journal of Computer Applications, 2024, 44(3): 767-771. |
[7] | Jingxin LIU, Wenjing HUANG, Liangsheng XU, Chong HUANG, Jiansheng WU. Unsupervised feature selection model with dictionary learning and sample correlation preservation [J]. Journal of Computer Applications, 2024, 44(12): 3766-3775. |
[8] | Yuhao TANG, Dezhong PENG, Zhong YUAN. Fuzzy multi-granularity anomaly detection for incomplete mixed data [J]. Journal of Computer Applications, 2024, 44(10): 3097-3104. |
[9] | Tian HE, Zongxin SHEN, Qianqian HUANG, Yanyong HUANG. Adaptive learning-based multi-view unsupervised feature selection method [J]. Journal of Computer Applications, 2023, 43(9): 2657-2664. |
[10] | Lin SUN, Jinxu HUANG, Jiucheng XU. Feature selection for imbalanced data based on neighborhood tolerance mutual information and whale optimization algorithm [J]. Journal of Computer Applications, 2023, 43(6): 1842-1854. |
[11] | Yuanjiang LI, Jinsheng QUAN, Yangyi TAN, Tian YANG. Attribute reduction for high-dimensional data based on bi-view of similarity and difference [J]. Journal of Computer Applications, 2023, 43(5): 1467-1472. |
[12] | Zhenhua YU, Zhengqi LIU, Ying LIU, Cheng GUO. Feature selection method based on self-adaptive hybrid particle swarm optimization for software defect prediction [J]. Journal of Computer Applications, 2023, 43(4): 1206-1213. |
[13] | Lin SUN, Tianjiao MA, Zhan’ao XUE. Multilabel feature selection algorithm based on Fisher score and fuzzy neighborhood entropy [J]. Journal of Computer Applications, 2023, 43(12): 3779-3789. |
[14] | Jingcheng XU, Xuebin CHEN, Yanling DONG, Jia YANG. DDoS attack detection by random forest fused with feature selection [J]. Journal of Computer Applications, 2023, 43(11): 3497-3503. |
[15] | Lei MA, Chuan LUO, Tianrui LI, Hongmei CHEN. Fuzzy-rough set based unsupervised dynamic feature selection algorithm [J]. Journal of Computer Applications, 2023, 43(10): 3121-3128. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||