《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (2): 475-484.DOI: 10.11772/j.issn.1001-9081.2021050957
收稿日期:
2021-03-25
修回日期:
2021-07-21
接受日期:
2021-07-21
发布日期:
2022-02-11
出版日期:
2022-02-10
通讯作者:
杨燕燕
作者简介:
李懿恒(2001—),男,山西临汾人,主要研究方向:机器学习;基金资助:
Yiheng LI, Chenxi DU, Yanyan YANG(), Xiangyu LI
Received:
2021-03-25
Revised:
2021-07-21
Accepted:
2021-07-21
Online:
2022-02-11
Published:
2022-02-10
Contact:
Yanyan YANG
About author:
LI Yiheng, born in 2001. His research interests include machine learning.Supported by:
摘要:
针对大多数粒计算特征选择算法未考虑数据的类别不平衡性的问题,提出一种融合伪标签策略的类别不平衡数据特征选择算法。首先,为了便于研究类别不平衡数据特征选择算法,重新定义样本和数据集一致度的概念,并设计了相应特征选择的贪婪前向搜索算法;其次,引入伪标签策略以平衡数据的类别分布,并将所学样本的伪标签融入一致性测度中,以构造伪标签一致度来估计类别不平衡数据集的特征;最后,通过保持类别不平衡数据集的伪标签一致度不变,设计一种面向类别不平衡数据的基于伪标签一致性的特征选择算法(PLCFS)。实验结果表明,所提PLCFS的性能仅次于最大相关最小冗余(mRMR)算法,而优于Relief算法和基于一致性的特征选择算法(CFS)。
中图分类号:
李懿恒, 杜晨曦, 杨燕燕, 李翔宇. 基于伪标签一致度的不平衡数据特征选择算法[J]. 计算机应用, 2022, 42(2): 475-484.
Yiheng LI, Chenxi DU, Yanyan YANG, Xiangyu LI. Feature selection algorithm for imbalanced data based on pseudo-label consistency[J]. Journal of Computer Applications, 2022, 42(2): 475-484.
序号 | 数据集 | I | F | IR | P/% | N/% | 数据类型 |
---|---|---|---|---|---|---|---|
D1 | arrhythmia | 452 | 279 | 9.27 | 9.73 | 90.27 | mixed |
D2 | crx | 690 | 15 | 1.25 | 44.49 | 55.51 | mixed |
D3 | glass | 214 | 9 | 2.06 | 32.71 | 67.29 | numerical |
D4 | heart | 270 | 13 | 1.25 | 44.44 | 55.56 | mixed |
D5 | segmentation | 2 308 | 19 | 6.02 | 14.25 | 85.75 | numerical |
D6 | tic-tac-toe | 958 | 9 | 1.89 | 34.66 | 65.34 | nominal |
D7 | wdbc | 569 | 30 | 1.68 | 37.26 | 62.74 | numerical |
D8 | wpbc | 198 | 33 | 3.21 | 23.74 | 76.26 | numerical |
D9 | yeast | 1 484 | 8 | 2.46 | 28.91 | 71.09 | numerical |
D10 | zoo | 101 | 16 | 19.20 | 4.95 | 95.05 | nominal |
表1 实验数据集
Tab. 1 Experimental datasets
序号 | 数据集 | I | F | IR | P/% | N/% | 数据类型 |
---|---|---|---|---|---|---|---|
D1 | arrhythmia | 452 | 279 | 9.27 | 9.73 | 90.27 | mixed |
D2 | crx | 690 | 15 | 1.25 | 44.49 | 55.51 | mixed |
D3 | glass | 214 | 9 | 2.06 | 32.71 | 67.29 | numerical |
D4 | heart | 270 | 13 | 1.25 | 44.44 | 55.56 | mixed |
D5 | segmentation | 2 308 | 19 | 6.02 | 14.25 | 85.75 | numerical |
D6 | tic-tac-toe | 958 | 9 | 1.89 | 34.66 | 65.34 | nominal |
D7 | wdbc | 569 | 30 | 1.68 | 37.26 | 62.74 | numerical |
D8 | wpbc | 198 | 33 | 3.21 | 23.74 | 76.26 | numerical |
D9 | yeast | 1 484 | 8 | 2.46 | 28.91 | 71.09 | numerical |
D10 | zoo | 101 | 16 | 19.20 | 4.95 | 95.05 | nominal |
数据集 | CFS | PLCFS | mRMR | Relief |
---|---|---|---|---|
D1 | 19 | 18 | 24 | 24 |
D2 | 13 | 13 | 10 | 12 |
D3 | 8 | 7 | 5 | 7 |
D4 | 12 | 11 | 10 | 10 |
D5 | 13 | 16 | 13 | 14 |
D6 | 10 | 8 | 8 | 7 |
D7 | 14 | 24 | 24 | 12 |
D8 | 15 | 16 | 17 | 11 |
D9 | 9 | 7 | 8 | 7 |
D10 | 10 | 11 | 14 | 13 |
表2 四种算法在10个数据集上选择的特征数
Tab. 2 Numbers of features selected by four algorithms on 10 datasets
数据集 | CFS | PLCFS | mRMR | Relief |
---|---|---|---|---|
D1 | 19 | 18 | 24 | 24 |
D2 | 13 | 13 | 10 | 12 |
D3 | 8 | 7 | 5 | 7 |
D4 | 12 | 11 | 10 | 10 |
D5 | 13 | 16 | 13 | 14 |
D6 | 10 | 8 | 8 | 7 |
D7 | 14 | 24 | 24 | 12 |
D8 | 15 | 16 | 17 | 11 |
D9 | 9 | 7 | 8 | 7 |
D10 | 10 | 11 | 14 | 13 |
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.555 0 | 0.541 8 | 0.6592 | 0.550 6 | 0.100 1 | 0.073 1 | 0.2378 | 0.102 4 | 0.725 3 | 0.715 8 | 0.7968 | 0.722 0 |
D2 | 0.823 2 | 0.824 6 | 0.815 9 | 0.8464 | 0.686 1 | 0.687 6 | 0.678 6 | 0.7132 | 0.823 2 | 0.824 6 | 0.815 9 | 0.8464 |
D3 | 0.614 3 | 0.614 3 | 0.6422 | 0.614 3 | 0.489 1 | 0.490 9 | 0.5289 | 0.488 6 | 0.492 6 | 0.492 3 | 0.5114 | 0.492 6 |
D4 | 0.8481 | 0.840 7 | 0.837 0 | 0.833 3 | 0.8448 | 0.837 1 | 0.833 5 | 0.830 5 | 0.8481 | 0.840 7 | 0.837 0 | 0.833 3 |
D5 | 0.781 0 | 0.776 2 | 0.7831 | 0.773 2 | 0.6979 | 0.665 9 | 0.695 7 | 0.683 9 | 0.864 7 | 0.862 1 | 0.8663 | 0.859 5 |
D6 | — | 0.6864 | 0.686 4 | 0.660 3 | — | 0.4346 | 0.434 6 | 0.429 6 | — | 0.6864 | 0.686 4 | 0.660 3 |
D7 | 0.949 0 | 0.9490 | 0.9490 | 0.933 2 | 0.942 5 | 0.941 9 | 0.9430 | 0.925 2 | 0.949 0 | 0.949 0 | 0.9490 | 0.933 2 |
D8 | 0.759 0 | 0.7590 | 0.7590 | 0.759 0 | 0.688 6 | 0.688 6 | 0.6886 | 0.688 6 | 0.159 0 | 0.159 0 | 0.1590 | 0.159 0 |
D9 | — | 0.476 4 | 0.4778 | 0.477 8 | — | 0.227 9 | 0.2679 | 0.228 8 | — | 0.669 5 | 0.6695 | 0.669 5 |
D10 | 0.9000 | 0.840 5 | 0.890 0 | 0.900 0 | 0.8174 | 0.680 7 | 0.803 5 | 0.809 3 | 0.736 7 | 0.8865 | 0.731 7 | 0.738 5 |
表3 十个数据集在SVM分类器上的指标得分
Tab. 3 Index scores of 10 datasets under SVM classifier
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.555 0 | 0.541 8 | 0.6592 | 0.550 6 | 0.100 1 | 0.073 1 | 0.2378 | 0.102 4 | 0.725 3 | 0.715 8 | 0.7968 | 0.722 0 |
D2 | 0.823 2 | 0.824 6 | 0.815 9 | 0.8464 | 0.686 1 | 0.687 6 | 0.678 6 | 0.7132 | 0.823 2 | 0.824 6 | 0.815 9 | 0.8464 |
D3 | 0.614 3 | 0.614 3 | 0.6422 | 0.614 3 | 0.489 1 | 0.490 9 | 0.5289 | 0.488 6 | 0.492 6 | 0.492 3 | 0.5114 | 0.492 6 |
D4 | 0.8481 | 0.840 7 | 0.837 0 | 0.833 3 | 0.8448 | 0.837 1 | 0.833 5 | 0.830 5 | 0.8481 | 0.840 7 | 0.837 0 | 0.833 3 |
D5 | 0.781 0 | 0.776 2 | 0.7831 | 0.773 2 | 0.6979 | 0.665 9 | 0.695 7 | 0.683 9 | 0.864 7 | 0.862 1 | 0.8663 | 0.859 5 |
D6 | — | 0.6864 | 0.686 4 | 0.660 3 | — | 0.4346 | 0.434 6 | 0.429 6 | — | 0.6864 | 0.686 4 | 0.660 3 |
D7 | 0.949 0 | 0.9490 | 0.9490 | 0.933 2 | 0.942 5 | 0.941 9 | 0.9430 | 0.925 2 | 0.949 0 | 0.949 0 | 0.9490 | 0.933 2 |
D8 | 0.759 0 | 0.7590 | 0.7590 | 0.759 0 | 0.688 6 | 0.688 6 | 0.6886 | 0.688 6 | 0.159 0 | 0.159 0 | 0.1590 | 0.159 0 |
D9 | — | 0.476 4 | 0.4778 | 0.477 8 | — | 0.227 9 | 0.2679 | 0.228 8 | — | 0.669 5 | 0.6695 | 0.669 5 |
D10 | 0.9000 | 0.840 5 | 0.890 0 | 0.900 0 | 0.8174 | 0.680 7 | 0.803 5 | 0.809 3 | 0.736 7 | 0.8865 | 0.731 7 | 0.738 5 |
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.537 3 | 0.526 4 | 0.6659 | 0.546 2 | 0.120 5 | 0.130 6 | 0.2786 | 0.182 8 | 0.714 4 | 0.707 4 | 0.8020 | 0.719 9 |
D2 | 0.829 0 | 0.826 1 | 0.8536 | 0.826 1 | 0.707 2 | 0.704 4 | 0.7405 | 0.694 0 | 0.829 0 | 0.826 1 | 0.8536 | 0.826 1 |
D3 | 0.646 8 | 0.646 8 | 0.651 5 | 0.6561 | 0.475 7 | 0.494 7 | 0.468 0 | 0.4988 | 0.526 3 | 0.525 8 | 0.530 9 | 0.5322 |
D4 | 0.8333 | 0.811 1 | 0.814 8 | 0.800 0 | 0.8299 | 0.805 7 | 0.811 9 | 0.795 4 | 0.8333 | 0.811 1 | 0.814 8 | 0.800 0 |
D5 | 0.822 5 | 0.826 0 | 0.8290 | 0.819 0 | 0.701 7 | 0.7043 | 0.703 8 | 0.695 6 | 0.891 6 | 0.893 7 | 0.8964 | 0.889 5 |
D6 | — | 0.690 6 | 0.717 8 | 0.7513 | — | 0.445 3 | 0.459 4 | 0.4909 | — | 0.690 6 | 0.7731 | 0.717 8 |
D7 | 0.947 3 | 0.940 2 | 0.9526 | 0.947 3 | 0.940 6 | 0.937 1 | 0.9476 | 0.940 7 | 0.947 3 | 0.943 8 | 0.9526 | 0.947 3 |
D8 | 0.663 3 | 0.6887 | 0.678 6 | 0.658 6 | 0.361 1 | 0.3855 | 0.381 2 | 0.375 5 | 0.663 3 | 0.6887 | 0.678 6 | 0.658 6 |
D9 | — | 0.425 8 | 0.4556 | 0.423 2 | — | 0.213 3 | 0.2173 | 0.213 1 | — | 0.630 1 | 0.6528 | 0.628 1 |
D10 | 0.860 0 | 0.820 5 | 0.8900 | 0.880 0 | 0.686 0 | 0.666 9 | 0.802 0 | 0.8065 | 0.710 3 | 0.8754 | 0.730 1 | 0.722 4 |
表4 十个数据集在KNN分类器上的指标得分
Tab. 4 Index scores of 10 datasets under KNN classifier
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.537 3 | 0.526 4 | 0.6659 | 0.546 2 | 0.120 5 | 0.130 6 | 0.2786 | 0.182 8 | 0.714 4 | 0.707 4 | 0.8020 | 0.719 9 |
D2 | 0.829 0 | 0.826 1 | 0.8536 | 0.826 1 | 0.707 2 | 0.704 4 | 0.7405 | 0.694 0 | 0.829 0 | 0.826 1 | 0.8536 | 0.826 1 |
D3 | 0.646 8 | 0.646 8 | 0.651 5 | 0.6561 | 0.475 7 | 0.494 7 | 0.468 0 | 0.4988 | 0.526 3 | 0.525 8 | 0.530 9 | 0.5322 |
D4 | 0.8333 | 0.811 1 | 0.814 8 | 0.800 0 | 0.8299 | 0.805 7 | 0.811 9 | 0.795 4 | 0.8333 | 0.811 1 | 0.814 8 | 0.800 0 |
D5 | 0.822 5 | 0.826 0 | 0.8290 | 0.819 0 | 0.701 7 | 0.7043 | 0.703 8 | 0.695 6 | 0.891 6 | 0.893 7 | 0.8964 | 0.889 5 |
D6 | — | 0.690 6 | 0.717 8 | 0.7513 | — | 0.445 3 | 0.459 4 | 0.4909 | — | 0.690 6 | 0.7731 | 0.717 8 |
D7 | 0.947 3 | 0.940 2 | 0.9526 | 0.947 3 | 0.940 6 | 0.937 1 | 0.9476 | 0.940 7 | 0.947 3 | 0.943 8 | 0.9526 | 0.947 3 |
D8 | 0.663 3 | 0.6887 | 0.678 6 | 0.658 6 | 0.361 1 | 0.3855 | 0.381 2 | 0.375 5 | 0.663 3 | 0.6887 | 0.678 6 | 0.658 6 |
D9 | — | 0.425 8 | 0.4556 | 0.423 2 | — | 0.213 3 | 0.2173 | 0.213 1 | — | 0.630 1 | 0.6528 | 0.628 1 |
D10 | 0.860 0 | 0.820 5 | 0.8900 | 0.880 0 | 0.686 0 | 0.666 9 | 0.802 0 | 0.8065 | 0.710 3 | 0.8754 | 0.730 1 | 0.722 4 |
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.566 1 | 0.539 7 | 0.661 4 | 0.550 6 | 0.210 5 | 0.184 8 | 0.336 3 | 0.222 4 | 0.734 6 | 0.716 3 | 0.798 5 | 0.720 7 |
D2 | 0.810 1 | 0.821 7 | 0.834 8 | 0.826 4 | 0.691 9 | 0.706 1 | 0.717 9 | 0.713 2 | 0.810 1 | 0.821 7 | 0.834 8 | 0.797 1 |
D3 | 0.646 8 | 0.698 7 | 0.651 5 | 0.654 3 | 0.499 1 | 0.536 9 | 0.616 0 | 0.568 6 | 0.523 4 | 0.556 2 | 0.323 0 | 0.360 9 |
D4 | 0.814 8 | 0.818 5 | 0.814 8 | 0.813 3 | 0.807 9 | 0.812 4 | 0.809 4 | 0.807 5 | 0.814 8 | 0.818 5 | 0.814 8 | 0.814 8 |
D5 | 0.829 9 | 0.833 3 | 0.843 7 | 0.833 2 | 0.723 4 | 0.724 9 | 0.734 8 | 0.723 9 | 0.896 4 | 0.898 6 | 0.904 9 | 0.894 7 |
D6 | — | 0.593 2 | 0.498 4 | 0.560 3 | — | 0.362 6 | 0.322 2 | 0.329 6 | — | 0.593 2 | 0.498 4 | 0.745 0 |
D7 | 0.935 0 | 0.943 8 | 0.938 5 | 0.935 0 | 0.926 2 | 0.937 3 | 0.930 5 | 0.924 2 | 0.935 0 | 0.943 8 | 0.938 5 | 0.938 5 |
D8 | 0.633 2 | 0.648 5 | 0.658 8 | 0.641 1 | 0.363 9 | 0.357 4 | 0.378 7 | 0.369 3 | 0.633 2 | 0.648 5 | 0.658 8 | 0.593 6 |
D9 | — | 0.444 1 | 0.444 1 | 0.444 1 | — | 0.279 4 | 0.315 0 | 0.287 3 | — | 0.643 8 | 0.643 8 | 0.641 7 |
D10 | 0.940 0 | 0.880 5 | 0.920 0 | 0.900 0 | 0.857 4 | 0.650 0 | 0.789 8 | 0.755 5 | 0.763 4 | 0.919 7 | 0.749 5 | 0.762 5 |
表5 十个数据集在RF分类器上的指标得分
Tab. 5 Index scores of 10 datasets under RF classifier
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.566 1 | 0.539 7 | 0.661 4 | 0.550 6 | 0.210 5 | 0.184 8 | 0.336 3 | 0.222 4 | 0.734 6 | 0.716 3 | 0.798 5 | 0.720 7 |
D2 | 0.810 1 | 0.821 7 | 0.834 8 | 0.826 4 | 0.691 9 | 0.706 1 | 0.717 9 | 0.713 2 | 0.810 1 | 0.821 7 | 0.834 8 | 0.797 1 |
D3 | 0.646 8 | 0.698 7 | 0.651 5 | 0.654 3 | 0.499 1 | 0.536 9 | 0.616 0 | 0.568 6 | 0.523 4 | 0.556 2 | 0.323 0 | 0.360 9 |
D4 | 0.814 8 | 0.818 5 | 0.814 8 | 0.813 3 | 0.807 9 | 0.812 4 | 0.809 4 | 0.807 5 | 0.814 8 | 0.818 5 | 0.814 8 | 0.814 8 |
D5 | 0.829 9 | 0.833 3 | 0.843 7 | 0.833 2 | 0.723 4 | 0.724 9 | 0.734 8 | 0.723 9 | 0.896 4 | 0.898 6 | 0.904 9 | 0.894 7 |
D6 | — | 0.593 2 | 0.498 4 | 0.560 3 | — | 0.362 6 | 0.322 2 | 0.329 6 | — | 0.593 2 | 0.498 4 | 0.745 0 |
D7 | 0.935 0 | 0.943 8 | 0.938 5 | 0.935 0 | 0.926 2 | 0.937 3 | 0.930 5 | 0.924 2 | 0.935 0 | 0.943 8 | 0.938 5 | 0.938 5 |
D8 | 0.633 2 | 0.648 5 | 0.658 8 | 0.641 1 | 0.363 9 | 0.357 4 | 0.378 7 | 0.369 3 | 0.633 2 | 0.648 5 | 0.658 8 | 0.593 6 |
D9 | — | 0.444 1 | 0.444 1 | 0.444 1 | — | 0.279 4 | 0.315 0 | 0.287 3 | — | 0.643 8 | 0.643 8 | 0.641 7 |
D10 | 0.940 0 | 0.880 5 | 0.920 0 | 0.900 0 | 0.857 4 | 0.650 0 | 0.789 8 | 0.755 5 | 0.763 4 | 0.919 7 | 0.749 5 | 0.762 5 |
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.416 0 | 0.433 5 | 0.661 5 | 0.444 5 | 0.174 1 | 0.205 7 | 0.314 1 | 0.203 1 | 0.624 7 | 0.640 3 | 0.798 8 | 0.648 6 |
D2 | 0.791 3 | 0.811 6 | 0.821 7 | 0.800 0 | 0.691 9 | 0.718 7 | 0.696 2 | 0.687 5 | 0.791 3 | 0.811 6 | 0.821 7 | 0.800 0 |
D3 | 0.660 9 | 0.660 9 | 0.656 3 | 0.660 9 | 0.635 6 | 0.635 6 | 0.631 5 | 0.635 6 | 0.333 5 | 0.333 5 | 0.329 0 | 0.333 5 |
D4 | 0.751 9 | 0.785 2 | 0.807 4 | 0.781 5 | 0.747 0 | 0.781 6 | 0.804 4 | 0.778 2 | 0.751 9 | 0.785 2 | 0.807 4 | 0.781 5 |
D5 | 0.830 7 | 0.829 9 | 0.833 3 | 0.825 5 | 0.721 0 | 0.702 2 | 0.724 2 | 0.716 0 | 0.897 2 | 0.897 4 | 0.898 7 | 0.893 7 |
D6 | — | 0.565 5 | 0.557 1 | 0.758 6 | — | 0.377 4 | 0.375 2 | 0.491 3 | — | 0.565 5 | 0.557 1 | 0.758 6 |
D7 | 0.922 6 | 0.919 2 | 0.913 9 | 0.936 7 | 0.913 2 | 0.910 6 | 0.903 9 | 0.928 5 | 0.922 6 | 0.919 2 | 0.913 9 | 0.936 7 |
D8 | 0.533 2 | 0.542 9 | 0.573 1 | 0.517 8 | 0.339 0 | 0.328 5 | 0.336 2 | 0.315 6 | 0.533 2 | 0.542 9 | 0.573 1 | 0.517 8 |
D9 | — | 0.450 2 | 0.452 2 | 0.447 5 | — | 0.253 8 | 0.310 0 | 0.213 9 | — | 0.648 7 | 0.650 2 | 0.646 6 |
D10 | 0.960 0 | 0.890 5 | 0.930 0 | 0.960 0 | 0.924 4 | 0.700 0 | 0.862 9 | 0.894 5 | 0.774 7 | 0.923 7 | 0.753 5 | 0.775 5 |
表6 十个数据集在DT分类器上的指标得分
Tab. 6 Index scores of 10 datasets under DT classifier
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.416 0 | 0.433 5 | 0.661 5 | 0.444 5 | 0.174 1 | 0.205 7 | 0.314 1 | 0.203 1 | 0.624 7 | 0.640 3 | 0.798 8 | 0.648 6 |
D2 | 0.791 3 | 0.811 6 | 0.821 7 | 0.800 0 | 0.691 9 | 0.718 7 | 0.696 2 | 0.687 5 | 0.791 3 | 0.811 6 | 0.821 7 | 0.800 0 |
D3 | 0.660 9 | 0.660 9 | 0.656 3 | 0.660 9 | 0.635 6 | 0.635 6 | 0.631 5 | 0.635 6 | 0.333 5 | 0.333 5 | 0.329 0 | 0.333 5 |
D4 | 0.751 9 | 0.785 2 | 0.807 4 | 0.781 5 | 0.747 0 | 0.781 6 | 0.804 4 | 0.778 2 | 0.751 9 | 0.785 2 | 0.807 4 | 0.781 5 |
D5 | 0.830 7 | 0.829 9 | 0.833 3 | 0.825 5 | 0.721 0 | 0.702 2 | 0.724 2 | 0.716 0 | 0.897 2 | 0.897 4 | 0.898 7 | 0.893 7 |
D6 | — | 0.565 5 | 0.557 1 | 0.758 6 | — | 0.377 4 | 0.375 2 | 0.491 3 | — | 0.565 5 | 0.557 1 | 0.758 6 |
D7 | 0.922 6 | 0.919 2 | 0.913 9 | 0.936 7 | 0.913 2 | 0.910 6 | 0.903 9 | 0.928 5 | 0.922 6 | 0.919 2 | 0.913 9 | 0.936 7 |
D8 | 0.533 2 | 0.542 9 | 0.573 1 | 0.517 8 | 0.339 0 | 0.328 5 | 0.336 2 | 0.315 6 | 0.533 2 | 0.542 9 | 0.573 1 | 0.517 8 |
D9 | — | 0.450 2 | 0.452 2 | 0.447 5 | — | 0.253 8 | 0.310 0 | 0.213 9 | — | 0.648 7 | 0.650 2 | 0.646 6 |
D10 | 0.960 0 | 0.890 5 | 0.930 0 | 0.960 0 | 0.924 4 | 0.700 0 | 0.862 9 | 0.894 5 | 0.774 7 | 0.923 7 | 0.753 5 | 0.775 5 |
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.577 3 | 0.575 1 | 0.661 6 | 0.581 7 | 0.191 8 | 0.194 5 | 0.279 9 | 0.246 4 | 0.742 8 | 0.739 9 | 0.799 0 | 0.744 8 |
D2 | 0.839 1 | 0.837 7 | 0.843 5 | 0.834 8 | 0.719 2 | 0.717 6 | 0.711 5 | 0.698 0 | 0.839 1 | 0.837 7 | 0.843 5 | 0.834 8 |
D3 | 0.637 4 | 0.642 2 | 0.642 1 | 0.646 8 | 0.456 1 | 0.472 7 | 0.472 5 | 0.465 5 | 0.515 2 | 0.521 4 | 0.515 9 | 0.526 1 |
D4 | 0.829 6 | 0.829 6 | 0.844 4 | 0.833 3 | 0.826 3 | 0.826 0 | 0.841 7 | 0.830 0 | 0.829 6 | 0.829 6 | 0.844 4 | 0.833 3 |
D5 | 0.801 7 | 0.801 7 | 0.826 0 | 0.819 5 | 0.689 0 | 0.682 4 | 0.703 6 | 0.697 3 | 0.878 9 | 0.877 1 | 0.894 4 | 0.890 3 |
D6 | — | 0.595 6 | 0.595 6 | 0.553 8 | — | 0.395 5 | 0.395 5 | 0.366 0 | — | 0.595 6 | 0.595 6 | 0.553 8 |
D7 | 0.945 5 | 0.950 8 | 0.943 8 | 0.943 7 | 0.938 7 | 0.944 8 | 0.936 5 | 0.936 3 | 0.945 5 | 0.950 8 | 0.943 8 | 0.943 7 |
D8 | 0.693 3 | 0.653 3 | 0.633 3 | 0.703 7 | 0.369 6 | 0.370 9 | 0.352 5 | 0.373 5 | 0.693 3 | 0.653 3 | 0.633 3 | 0.703 7 |
D9 | — | 0.475 8 | 0.480 5 | 0.476 4 | — | 0.235 0 | 0.299 8 | 0.235 7 | — | 0.668 0 | 0.671 5 | 0.668 5 |
D10 | 0.930 0 | 0.860 5 | 0.930 0 | 0.930 0 | 0.852 4 | 0.711 3 | 0.847 9 | 0.871 0 | 0.757 0 | 0.905 2 | 0.757 0 | 0.755 9 |
表7 十个数据集在LR分类器上的指标得分
Tab. 7 Index scores of 10 datasets under LR classifier
数据集 | micro-F1 | macro-F1 | G-Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | CFS | PLCFS | mRMR | Relief | |
D1 | 0.577 3 | 0.575 1 | 0.661 6 | 0.581 7 | 0.191 8 | 0.194 5 | 0.279 9 | 0.246 4 | 0.742 8 | 0.739 9 | 0.799 0 | 0.744 8 |
D2 | 0.839 1 | 0.837 7 | 0.843 5 | 0.834 8 | 0.719 2 | 0.717 6 | 0.711 5 | 0.698 0 | 0.839 1 | 0.837 7 | 0.843 5 | 0.834 8 |
D3 | 0.637 4 | 0.642 2 | 0.642 1 | 0.646 8 | 0.456 1 | 0.472 7 | 0.472 5 | 0.465 5 | 0.515 2 | 0.521 4 | 0.515 9 | 0.526 1 |
D4 | 0.829 6 | 0.829 6 | 0.844 4 | 0.833 3 | 0.826 3 | 0.826 0 | 0.841 7 | 0.830 0 | 0.829 6 | 0.829 6 | 0.844 4 | 0.833 3 |
D5 | 0.801 7 | 0.801 7 | 0.826 0 | 0.819 5 | 0.689 0 | 0.682 4 | 0.703 6 | 0.697 3 | 0.878 9 | 0.877 1 | 0.894 4 | 0.890 3 |
D6 | — | 0.595 6 | 0.595 6 | 0.553 8 | — | 0.395 5 | 0.395 5 | 0.366 0 | — | 0.595 6 | 0.595 6 | 0.553 8 |
D7 | 0.945 5 | 0.950 8 | 0.943 8 | 0.943 7 | 0.938 7 | 0.944 8 | 0.936 5 | 0.936 3 | 0.945 5 | 0.950 8 | 0.943 8 | 0.943 7 |
D8 | 0.693 3 | 0.653 3 | 0.633 3 | 0.703 7 | 0.369 6 | 0.370 9 | 0.352 5 | 0.373 5 | 0.693 3 | 0.653 3 | 0.633 3 | 0.703 7 |
D9 | — | 0.475 8 | 0.480 5 | 0.476 4 | — | 0.235 0 | 0.299 8 | 0.235 7 | — | 0.668 0 | 0.671 5 | 0.668 5 |
D10 | 0.930 0 | 0.860 5 | 0.930 0 | 0.930 0 | 0.852 4 | 0.711 3 | 0.847 9 | 0.871 0 | 0.757 0 | 0.905 2 | 0.757 0 | 0.755 9 |
1 | 李艳霞,柴毅,胡友强,等.不平衡数据分类方法综述[J].控制与决策, 2019, 34(4): 673-688. 10.13195/j.kzyjc.2018.0865 |
LI Y X, CHAI Y, HU Y Q, et al. Review of imbalanced data classification methods[J]. Control and Decision, 2019, 34(4): 673-688. 10.13195/j.kzyjc.2018.0865 | |
2 | HE H B, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284. 10.1109/tkde.2008.239 |
3 | JING X Y, ZHANG X Y, ZHU X K, et al. Multiset feature learning for highly imbalanced data classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1): 139-156. 10.1109/tpami.2019.2929166 |
4 | KHORSHIDI H A, AICKELIN U. Constructing classifiers for imbalanced data using diversity optimization[J]. Information Sciences, 2021, 565: 1-16. 10.1016/j.ins.2021.02.069 |
5 | FU Y G, HUANG H Y, GUAN Y, et al. EBRB cascade classifier for imbalanced data via rule weight updating[J]. Knowledge-Based Systems, 2021, 223: No.107010. 10.1016/j.knosys.2021.107010 |
6 | ZHENG Z H, WU X Y, SRIHARI R. Feature selection for text categorization on imbalanced data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 80-89. 10.1145/1007730.1007741 |
7 | HULSE J VAN, KHOSHGOFTAAR T M, NAPOLITANO A, et al. Feature selection with high-dimensional imbalanced data [C]// Proceedings of the 2009 IEEE International Conference on Data Mining Workshops. Piscataway: IEEE, 2009: 507-514. 10.1109/icdmw.2009.35 |
8 | WASIKOWSKI M, CHEN X W. Combating the small sample class imbalance problem using feature selection[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1388-1400. 10.1109/tkde.2009.187 |
9 | MALDONADO S, WEBER R, FAMILI F. Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines[J]. Information Sciences, 2014, 286: 228-246. 10.1016/j.ins.2014.07.015 |
10 | YIN L Z, GE Y, XIAO K L, et al. Feature selection for high-dimensional imbalanced data[J]. Neurocomputing, 2013, 105: 3-11. 10.1016/j.neucom.2012.04.039 |
11 | FU G H, WU Y J, ZONG M J, et al. Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics[J]. Chemometrics and Intelligent Laboratory Systems, 2020, 196: No.103906. 10.1016/j.chemolab.2019.103906 |
12 | PEDRYCZ W. Granular Computing: Analysis and Design of Intelligent Systems[M]. Boca Raton: CRC Press, 2013: 15-36. 10.1201/b14862 |
13 | YAO Y Y. Three-way granular computing, rough sets, and formal concept analysis[J]. International Journal of Approximate Reasoning, 2020, 116: 106-125. 10.1016/j.ijar.2019.11.002 |
14 | LIU J F, HU Q H, YU D R. A weighted rough set based method developed for class imbalance learning[J]. Information Sciences, 2008, 178(4): 1235-1256. 10.1016/j.ins.2007.10.002 |
15 | ZHOU P, HU X G, LI P P, et al. Online feature selection for high-dimensional class-imbalanced data[J]. Knowledge-Based Systems, 2017, 136: 187-199. 10.1016/j.knosys.2017.09.006 |
16 | CHEN H M, LI T R, FAN X, et al. Feature selection for imbalanced data based on neighborhood rough sets[J]. Information Sciences, 2019, 483: 1-20. 10.1016/j.ins.2019.01.041 |
17 | XUE J H, HALL P. Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis?[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(5): 1109-1112. 10.1109/tpami.2014.2359660 |
18 | YANG Y Z, XU Z. Rethinking the value of labels for improving class-imbalanced learning[C/OL]// Proceedings of the 34th Conference on Neural Information Processing Systems. [2021-03-28]. . |
19 | YANG X B, LIANG S C, YU H L, et al. Pseudo-label neighborhood rough set: measures and attribute reductions[J]. International Journal of Approximate Reasoning, 2019, 105: 112-129. 10.1016/j.ijar.2018.11.010 |
20 | ZENG W R, CHEN X W, CHENG H. Pseudo labels for imbalanced multi-label learning [C]// Proceedings of the 2014 International Conference on Data Science and Advanced Analytics. Piscataway: IEEE, 2014: 25-31. 10.1109/dsaa.2014.7058047 |
21 | MIAO D Q, ZHAO Y, YAO Y Y, et al. Relative reducts in consistent and inconsistent decision tables of the Pawlak rough set model[J]. Information Sciences, 2009, 179(24): 4140-4150. 10.1016/j.ins.2009.08.020 |
22 | YANG Y Y, CHEN D G, WANG H. Active sample selection based incremental algorithm for attribute reduction with rough sets[J]. IEEE Transactions on Fuzzy Systems, 2017, 25(4): 825-838. 10.1109/tfuzz.2016.2581186 |
23 | PENG H C, LONG F H, DING C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238. 10.1109/tpami.2005.159 |
24 | KONONENKO I. Estimating attributes: analysis and extensions of RELIEF [C]// Proceedings of the 1994 European Conference on Machine Learning, LNCS784. Berlin: Springer, 1994: 171-182. |
[1] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[2] | 姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785. |
[3] | 雷明珠, 王浩, 贾蓉, 白琳, 潘晓英. 基于特征间关系合成少数类样本的过采样算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1428-1436. |
[4] | 高麟, 周宇, 邝得互. 进化双层自适应局部特征选择[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1408-1414. |
[5] | 徐大鹏, 侯新民. 基于网络结构设计的图神经网络特征选择方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 663-670. |
[6] | 孟圣洁, 于万钧, 陈颖. 最大相关和最大差异的高维数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 767-771. |
[7] | 孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841. |
[8] | 刘晶鑫, 黄雯静, 徐亮胜, 黄冲, 吴建生. 字典学习与样本关联保持结合的无监督特征选择模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3766-3775. |
[9] | 高肇泽, 朱小飞, 项能强. 基于类别感知课程学习的半监督立场检测[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3281-3287. |
[10] | 何添, 沈宗鑫, 黄倩倩, 黄雁勇. 基于自适应学习的多视图无监督特征选择方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2657-2664. |
[11] | 孙林, 黄金旭, 徐久成. 基于邻域容差互信息和鲸鱼优化算法的非平衡数据特征选择[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1842-1854. |
[12] | 李元江, 权金升, 谭阳奕, 杨田. 基于相似和差异双视角的高维数据属性约简[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1467-1472. |
[13] | 于振华, 刘争气, 刘颖, 郭城. 基于自适应混合粒子群优化的软件缺陷预测特征选择方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1206-1213. |
[14] | 孙林, 马天娇, 薛占熬. 基于Fisher score与模糊邻域熵的多标记特征选择算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3779-3789. |
[15] | 徐精诚, 陈学斌, 董燕灵, 杨佳. 融合特征选择的随机森林DDoS攻击检测[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3497-3503. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||