Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3750-3755.DOI: 10.11772/j.issn.1001-9081.2021101837
Special Issue: 数据科学与技术
• Data science and technology • Previous Articles Next Articles
Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG()
Received:
2021-10-28
Revised:
2022-01-06
Accepted:
2022-01-10
Online:
2022-01-19
Published:
2022-12-10
Contact:
Zhen JIANG
About author:
LU Yu,born in 1997, M. S. candidate. His research interests include machine learning.Supported by:
通讯作者:
姜震
作者简介:
陆宇(1997—),男,江苏徐州人,硕士研究生,主要研究方向:机器学习基金资助:
CLC Number:
Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG. Imbalanced classification algorithm based on improved semi-supervised clustering[J]. Journal of Computer Applications, 2022, 42(12): 3750-3755.
陆宇, 赵凌云, 白斌雯, 姜震. 基于改进的半监督聚类的不平衡分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3750-3755.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021101837
数据集 | 样本数 | 特征数 | 类别 | IR |
---|---|---|---|---|
ecoli-0-4-6vs5 | 203 | 6 | 1∶-1 | 9.15 |
dermatology-6 | 358 | 34 | 1∶-1 | 16.90 |
pima | 768 | 8 | 1∶-1 | 1.87 |
yeast5 | 1 484 | 8 | 1∶-1 | 32.73 |
abalone19 | 4 174 | 8 | 1∶-1 | 129.44 |
pageblock1 | 5 473 | 10 | 1∶others | 15.60 |
pageblock2 | 5 473 | 10 | 2∶others | 8.77 |
HTRU2 | 17 898 | 10 | 1∶-1 | 9.92 |
letter4 | 15 000 | 16 | 4∶others | 24.08 |
ijcnn1 | 49 990 | 22 | 1∶-1 | 9.30 |
Tab. 1 Basic information of datasets
数据集 | 样本数 | 特征数 | 类别 | IR |
---|---|---|---|---|
ecoli-0-4-6vs5 | 203 | 6 | 1∶-1 | 9.15 |
dermatology-6 | 358 | 34 | 1∶-1 | 16.90 |
pima | 768 | 8 | 1∶-1 | 1.87 |
yeast5 | 1 484 | 8 | 1∶-1 | 32.73 |
abalone19 | 4 174 | 8 | 1∶-1 | 129.44 |
pageblock1 | 5 473 | 10 | 1∶others | 15.60 |
pageblock2 | 5 473 | 10 | 2∶others | 8.77 |
HTRU2 | 17 898 | 10 | 1∶-1 | 9.92 |
letter4 | 15 000 | 16 | 4∶others | 24.08 |
ijcnn1 | 49 990 | 22 | 1∶-1 | 9.30 |
数据集 | B- SMOTE | SVM- SMOTE | K-Means SMOTE | ADASYN | RCSSMOTE | TU | CDSMOTE | CS-K-Means | SVM | C_SVM | 本文算法 |
---|---|---|---|---|---|---|---|---|---|---|---|
平均值 | 0.873 8 | 0.875 7 | 0.842 7 | 0.895 7 | 0.916 4 | 0.882 2 | 0.905 9 | 0.833 9 | 0.830 1 | 0.932 3 | 0.9414 |
ecoli-0-4-6vs5 | 0.934 6 | 0.926 2 | 0.897 3 | 0.902 7 | 0.915 2 | 0.909 6 | 0.915 2 | 0.847 7 | 0.923 5 | 0.972 1 | 0.9807 |
dermatology-6 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.0000 |
pima | 0.728 8 | 0.716 0 | 0.726 5 | 0.695 5 | 0.756 3 | 0.735 3 | 0.752 0 | 0.734 6 | 0.717 8 | 0.752 3 | 0.7714 |
yeast5 | 0.735 1 | 0.749 7 | 0.759 7 | 0.962 2 | 0.9872 | 0.843 6 | 0.987 1 | 0.784 3 | 0.602 3 | 0.986 9 | 0.974 4 |
abalone19 | 0.649 5 | 0.641 4 | 0.597 9 | 0.723 8 | 0.716 2 | 0.511 5 | 0.758 1 | 0.521 1 | 0.525 4 | 0.8085 | 0.804 8 |
pageblock1 | 0.953 5 | 0.961 8 | 0.854 1 | 0.953 7 | 0.929 9 | 0.947 3 | 0.9638 | 0.924 7 | 0.917 4 | 0.959 6 | 0.963 1 |
pageblock2 | 0.923 0 | 0.941 8 | 0.850 8 | 0.920 0 | 0.978 7 | 0.9873 | 0.886 9 | 0.859 1 | 0.908 2 | 0.964 4 | 0.971 3 |
HTRU2 | 0.918 9 | 0.922 9 | 0.893 8 | 0.904 7 | 0.965 6 | 0.951 7 | 0.932 3 | 0.955 6 | 0.905 3 | 0.930 7 | 0.9708 |
letter_4 | 0.929 8 | 0.931 7 | 0.926 3 | 0.940 4 | 0.952 6 | 0.957 2 | 0.911 3 | 0.900 9 | 0.900 4 | 0.987 4 | 0.9956 |
ijcnn1 | 0.964 5 | 0.965 7 | 0.920 6 | 0.953 7 | 0.962 3 | 0.978 1 | 0.952 1 | 0.811 4 | 0.901 1 | 0.961 0 | 0.9817 |
Tab. 2 AUC comparison of different algorithms
数据集 | B- SMOTE | SVM- SMOTE | K-Means SMOTE | ADASYN | RCSSMOTE | TU | CDSMOTE | CS-K-Means | SVM | C_SVM | 本文算法 |
---|---|---|---|---|---|---|---|---|---|---|---|
平均值 | 0.873 8 | 0.875 7 | 0.842 7 | 0.895 7 | 0.916 4 | 0.882 2 | 0.905 9 | 0.833 9 | 0.830 1 | 0.932 3 | 0.9414 |
ecoli-0-4-6vs5 | 0.934 6 | 0.926 2 | 0.897 3 | 0.902 7 | 0.915 2 | 0.909 6 | 0.915 2 | 0.847 7 | 0.923 5 | 0.972 1 | 0.9807 |
dermatology-6 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.0000 |
pima | 0.728 8 | 0.716 0 | 0.726 5 | 0.695 5 | 0.756 3 | 0.735 3 | 0.752 0 | 0.734 6 | 0.717 8 | 0.752 3 | 0.7714 |
yeast5 | 0.735 1 | 0.749 7 | 0.759 7 | 0.962 2 | 0.9872 | 0.843 6 | 0.987 1 | 0.784 3 | 0.602 3 | 0.986 9 | 0.974 4 |
abalone19 | 0.649 5 | 0.641 4 | 0.597 9 | 0.723 8 | 0.716 2 | 0.511 5 | 0.758 1 | 0.521 1 | 0.525 4 | 0.8085 | 0.804 8 |
pageblock1 | 0.953 5 | 0.961 8 | 0.854 1 | 0.953 7 | 0.929 9 | 0.947 3 | 0.9638 | 0.924 7 | 0.917 4 | 0.959 6 | 0.963 1 |
pageblock2 | 0.923 0 | 0.941 8 | 0.850 8 | 0.920 0 | 0.978 7 | 0.9873 | 0.886 9 | 0.859 1 | 0.908 2 | 0.964 4 | 0.971 3 |
HTRU2 | 0.918 9 | 0.922 9 | 0.893 8 | 0.904 7 | 0.965 6 | 0.951 7 | 0.932 3 | 0.955 6 | 0.905 3 | 0.930 7 | 0.9708 |
letter_4 | 0.929 8 | 0.931 7 | 0.926 3 | 0.940 4 | 0.952 6 | 0.957 2 | 0.911 3 | 0.900 9 | 0.900 4 | 0.987 4 | 0.9956 |
ijcnn1 | 0.964 5 | 0.965 7 | 0.920 6 | 0.953 7 | 0.962 3 | 0.978 1 | 0.952 1 | 0.811 4 | 0.901 1 | 0.961 0 | 0.9817 |
数据集 | B- SMOTE | SVM- SMOTE | K-Means SMOTE | ADASYN | RCSSMOTE | TU | CDSMOTE | CS-K-Means | SVM | C_SVM | 本文算法 |
---|---|---|---|---|---|---|---|---|---|---|---|
平均值 | 0.851 8 | 0.852 8 | 0.821 5 | 0.879 6 | 0.837 8 | 0.834 6 | 0.879 8 | 0.749 5 | 0.791 0 | 0.894 3 | 0.8981 |
ecoli-0-4-6vs5 | 0.9306 | 0.920 4 | 0.897 3 | 0.893 0 | 0.885 3 | 0.904 0 | 0.858 7 | 0.827 8 | 0.888 1 | 0.912 1 | 0.916 9 |
dermatology-6 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 |
pima | 0.691 5 | 0.690 7 | 0.709 7 | 0.665 3 | 0.705 0 | 0.702 1 | 0.753 9 | 0.693 3 | 0.8357 | 0.705 6 | 0.693 9 |
yeast5 | 0.673 6 | 0.684 7 | 0.705 2 | 0.9617 | 0.596 4 | 0.827 5 | 0.811 2 | 0.807 6 | 0.722 3 | 0.933 3 | 0.947 1 |
abalone19 | 0.581 7 | 0.568 7 | 0.497 3 | 0.658 5 | 0.612 2 | 0.215 6 | 0.7674 | 0.312 5 | 0.000 0 | 0.660 2 | 0.659 3 |
pageblock1 | 0.916 0 | 0.901 7 | 0.839 1 | 0.909 0 | 0.890 1 | 0.935 5 | 0.943 7 | 0.817 3 | 0.895 3 | 0.949 2 | 0.9501 |
pageblock2 | 0.861 0 | 0.902 1 | 0.783 3 | 0.860 0 | 0.883 6 | 0.9348 | 0.884 7 | 0.662 6 | 0.842 5 | 0.923 9 | 0.925 5 |
HTRU2 | 0.918 9 | 0.922 8 | 0.888 3 | 0.904 2 | 0.903 5 | 0.896 9 | 0.930 8 | 0.898 2 | 0.892 9 | 0.929 0 | 0.9315 |
letter_4 | 0.979 7 | 0.971 7 | 0.976 2 | 0.980 3 | 0.945 6 | 0.956 3 | 0.939 8 | 0.852 6 | 0.916 2 | 0.972 4 | 0.9812 |
ijcnn1 | 0.964 5 | 0.965 6 | 0.918 7 | 0.964 4 | 0.956 2 | 0.973 5 | 0.907 5 | 0.623 3 | 0.917 1 | 0.957 1 | 0.9752 |
Tab. 3 G-mean comparison of different algorithms
数据集 | B- SMOTE | SVM- SMOTE | K-Means SMOTE | ADASYN | RCSSMOTE | TU | CDSMOTE | CS-K-Means | SVM | C_SVM | 本文算法 |
---|---|---|---|---|---|---|---|---|---|---|---|
平均值 | 0.851 8 | 0.852 8 | 0.821 5 | 0.879 6 | 0.837 8 | 0.834 6 | 0.879 8 | 0.749 5 | 0.791 0 | 0.894 3 | 0.8981 |
ecoli-0-4-6vs5 | 0.9306 | 0.920 4 | 0.897 3 | 0.893 0 | 0.885 3 | 0.904 0 | 0.858 7 | 0.827 8 | 0.888 1 | 0.912 1 | 0.916 9 |
dermatology-6 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 | 1.000 0 |
pima | 0.691 5 | 0.690 7 | 0.709 7 | 0.665 3 | 0.705 0 | 0.702 1 | 0.753 9 | 0.693 3 | 0.8357 | 0.705 6 | 0.693 9 |
yeast5 | 0.673 6 | 0.684 7 | 0.705 2 | 0.9617 | 0.596 4 | 0.827 5 | 0.811 2 | 0.807 6 | 0.722 3 | 0.933 3 | 0.947 1 |
abalone19 | 0.581 7 | 0.568 7 | 0.497 3 | 0.658 5 | 0.612 2 | 0.215 6 | 0.7674 | 0.312 5 | 0.000 0 | 0.660 2 | 0.659 3 |
pageblock1 | 0.916 0 | 0.901 7 | 0.839 1 | 0.909 0 | 0.890 1 | 0.935 5 | 0.943 7 | 0.817 3 | 0.895 3 | 0.949 2 | 0.9501 |
pageblock2 | 0.861 0 | 0.902 1 | 0.783 3 | 0.860 0 | 0.883 6 | 0.9348 | 0.884 7 | 0.662 6 | 0.842 5 | 0.923 9 | 0.925 5 |
HTRU2 | 0.918 9 | 0.922 8 | 0.888 3 | 0.904 2 | 0.903 5 | 0.896 9 | 0.930 8 | 0.898 2 | 0.892 9 | 0.929 0 | 0.9315 |
letter_4 | 0.979 7 | 0.971 7 | 0.976 2 | 0.980 3 | 0.945 6 | 0.956 3 | 0.939 8 | 0.852 6 | 0.916 2 | 0.972 4 | 0.9812 |
ijcnn1 | 0.964 5 | 0.965 6 | 0.918 7 | 0.964 4 | 0.956 2 | 0.973 5 | 0.907 5 | 0.623 3 | 0.917 1 | 0.957 1 | 0.9752 |
1 | KAUR H, PANNU H S, MALHI A K. A systematic review on imbalanced data challenges in machine learning[J]. ACM Computing Surveys, 2020, 52(3): No.79. 10.1145/3343440 |
2 | AHMED F, SHAMSUDDIN R. A comparative study of credit card fraud detection using the combination of machine learning techniques with data imbalance solution[C]// Proceedings of the 2nd International Conference on Computing and Data Science. Piscataway: IEEE, 2021: 112-118. 10.1109/cds52072.2021.00026 |
3 | TAO X M, LI Q, GUO W J, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Information Sciences, 2019, 487:31-56. 10.1016/j.ins.2019.02.062 |
4 | XU X L, CHEN W, SUN Y F, et al. Over-sampling algorithm for imbalanced data classification[J]. Journal of Systems Engineering and Electronics, 2019, 30(6):1182-1191. 10.21629/jsee.2019.06.12 |
5 | HUI H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]// Proceedings of the 2005 International Conference on Intelligent Computing, LNCS 3644. Berlin: Springer, 2005:878-887. |
6 | HE H B, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Piscataway: IEEE, 2008:1322-1328. 10.1109/ijcnn.2008.4633969 |
7 | SOLTANZADEH P, HASHEMZADEH M. RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem[J]. Information Sciences, 2020, 542:92-111. 10.1016/j.ins.2020.07.014 |
8 | IRANMEHR A, MASNADI-SHIRAZI H, VASCONCELOS N. Cost-sensitive support vector machines[J]. Neurocomputing, 2019, 343: 50-64. 10.1016/j.neucom.2018.11.099 |
9 | YANG K X, YU Z W, CHEN C L P, et al. Incremental weighted ensemble broad learning system for imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2021(Early Access):1-1. |
10 | DHAR S, CHERKASSKY V. Development and evaluation of cost-sensitive Universum-SVM[J]. IEEE Transactions on Cybernetics, 2015, 45(4):806-818. 10.1109/tcyb.2014.2336876 |
11 | NÚÑEZ H, GONZALEZ-ABRIL L, ANGULO C. Improving SVM classification on imbalanced datasets by introducing a new bias[J]. Journal of Classification, 2017, 34(3):427-443. 10.1007/s00357-017-9242-x |
12 | FERNANDES E R Q, DE CARVALHO A C P L F, YAO X. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(6): 1104-1115. 10.1109/tkde.2019.2898861 |
13 | SUN Y M, KAMEL M S, WONG A K C, et al. Cost-sensitive boosting for classification of imbalanced data[J]. Pattern Recognition, 2007, 40(12):3358-3378. 10.1016/j.patcog.2007.04.009 |
14 | 王忠震,黄勃,方志军,等. 改进SMOTE的不平衡数据集成分类算法[J]. 计算机应用, 2019, 39(9): 2591-2596. 10.11772/j.issn.1001-9081.2019030531 |
WANG Z Z, HUANG B, FANG Z J, et al. Improved SMOTE unbalanced data integration classification algorithm[J]. Journal of Computer Applications, 2019, 39(9): 2591-2596. 10.11772/j.issn.1001-9081.2019030531 | |
15 | CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16:321-357. 10.1613/jair.953 |
16 | ELYAN E, MORENO-GARCIA C F, JAYNE C. CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification[J]. Neural Computing and Applications, 2021, 33(7):2839-2851. 10.1007/s00521-020-05130-z |
17 | PENG M L, ZHANG Q, XING X Y, et al. Trainable undersampling for class-imbalance learning[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2019: 4707-4714. 10.1609/aaai.v33i01.33014707 |
18 | LAST F, DOUZAS G, BACAO F. Oversampling for imbalanced learning based on k-means and SMOTE[EB/OL]. (2017-12-12)[2021-10-11].. |
19 | TAO X M, LI Q, GUO W J, et al. Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering[J]. Information Sciences, 2020, 519:43-73. 10.1016/j.ins.2020.01.032 |
20 | TSAI C F, LIN W C, HU Y H, et al. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J]. Information Sciences, 2019, 477:47-54. 10.1016/j.ins.2018.10.029 |
21 | WAGSTAFF K, CARDIE C, ROGERS S, et al. Constrained k-means clustering with background knowledge[C]// Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 2001:577-584. |
22 | BASU S, BANERJEE A, MOONEY R. Semi-supervised clustering by seeding[C]// Proceedings of the 19th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 2002:19-26. |
23 | NGUYEN H M, COOPER E W, KAMEI K. Borderline over-sampling for imbalanced data classification[J]. International Journal of Knowledge Engineering and Soft Data Paradigms, 2011, 3(1):4-21. 10.1504/ijkesdp.2011.039875 |
[1] | Ying HUANG, Jiayu YANG, Jiahao JIN, Bangrui WAN. Siamese mixed information fusion algorithm for RGBT tracking [J]. Journal of Computer Applications, 2024, 44(9): 2878-2885. |
[2] | Na WANG, Lin JIANG, Yuancheng LI, Yun ZHU. Optimization of tensor virtual machine operator fusion based on graph rewriting and fusion exploration [J]. Journal of Computer Applications, 2024, 44(9): 2802-2809. |
[3] | Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877. |
[4] | Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413. |
[5] | Chenyang LI, Long ZHANG, Qiusheng ZHENG, Shaohua QIAN. Multivariate controllable text generation based on diffusion sequences [J]. Journal of Computer Applications, 2024, 44(8): 2414-2420. |
[6] | Qiangkui LENG, Xuezi SUN, Xiangfu MENG. Oversampling method for imbalanced data based on sample potential and noise evolution [J]. Journal of Computer Applications, 2024, 44(8): 2466-2475. |
[7] | Yi LIU, Guoli YANG, Qibin ZHENG, Xiang LI, Yangsen ZHOU, Depeng CHEN. Architecture design of data fusion pipeline for unmanned systems [J]. Journal of Computer Applications, 2024, 44(8): 2536-2543. |
[8] | Chenqian LI, Jun LIU. Ultrasound carotid plaque segmentation method based on semi-supervision and multi-scale cascaded attention [J]. Journal of Computer Applications, 2024, 44(8): 2604-2610. |
[9] | Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625. |
[10] | Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191. |
[11] | Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257. |
[12] | Wei LI, Xiaorong ZHANG, Peng CHEN, Qing LI, Changqing ZHANG. Crowd counting algorithm with multi-scale fusion based on normal inverse Gamma distribution [J]. Journal of Computer Applications, 2024, 44(7): 2243-2249. |
[13] | Yue LIU, Fang LIU, Aoyun WU, Qiuyue CHAI, Tianxiao WANG. 3D object detection network based on self-attention mechanism and graph convolution [J]. Journal of Computer Applications, 2024, 44(6): 1972-1977. |
[14] | Wei LUO, Jinquan LIU, Zheng ZHANG. Dual vertical federated learning framework incorporating secret sharing technology [J]. Journal of Computer Applications, 2024, 44(6): 1872-1879. |
[15] | Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||