基于改进的半监督聚类的不平衡分类算法

doi:10.11772/j.issn.1001-9081.2021101837

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3750-3755.DOI: 10.11772/j.issn.1001-9081.2021101837

• 数据科学与技术 • 上一篇

基于改进的半监督聚类的不平衡分类算法

陆宇, 赵凌云, 白斌雯, 姜震()

江苏大学计算机科学与通信工程学院，江苏镇江 212013

收稿日期:2021-10-28 修回日期:2022-01-06 接受日期:2022-01-10 发布日期:2022-01-19 出版日期:2022-12-10
通讯作者: 姜震
作者简介:陆宇（1997—），男，江苏徐州人，硕士研究生，主要研究方向：机器学习
赵凌云（1996—），男，江苏淮安人，硕士研究生，主要研究方向：机器学习
白斌雯（2001—），男，山西太原人，主要研究方向：机器学习
基金资助:
国家自然科学基金资助项目(61906077);江苏大学大学生实践创新训练计划项目(202010299312X)

Imbalanced classification algorithm based on improved semi-supervised clustering

Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG()

College of Computer Science and Communication Engineering，Jiangsu University，Zhenjiang Jiangsu 212013，China

Received:2021-10-28 Revised:2022-01-06 Accepted:2022-01-10 Online:2022-01-19 Published:2022-12-10
Contact: Zhen JIANG
About author:LU Yu，born in 1997， M. S. candidate. His research interests include machine learning.
ZHAO Lingyun，born in 1996， M. S. candidate. His research interests include machine learning.
BAI Binwen， born in 2001. His research interests include machine learning.
Supported by:
National Natural Science Foundation of China(61906077);Practical Innovation Training Program for College Students of Jiangsu University(202010299312X)

摘要/Abstract

摘要：

不平衡分类的相关算法是机器学习领域的研究热点之一，其中的过采样通过重复抽取或者人工合成来增加少数类样本，以实现数据集的再平衡。然而当前的过采样方法大部分是基于原有的样本分布进行的，难以揭示更多的数据集分布特征。为了解决以上问题，首先，提出一种改进的半监督聚类算法来挖掘数据的分布特征；其次，基于半监督聚类的结果，在属于少数类的簇中选择置信度高的无标签数据（伪标签样本）加入原始训练集，这样做除了实现数据集的再平衡外，还可以利用半监督聚类获得的分布特征来辅助不平衡分类；最后，融合半监督聚类和分类的结果来预测最终的类别标签，从而进一步提高算法的不平衡分类性能。选择G-mean和曲线下面积（AUC）作为评价指标，将所提算法与TU、CDSMOTE等7个基于过采样或欠采样的不平衡分类算法在10个公开数据集上进行了对比分析。实验结果表明，与TU、CDSMOTE相比，所提算法在AUC指标上分别平均提高了6.7%和3.9%，在G-mean指标上分别平均提高了7.6%和2.1%，且在两个评价指标上相较于所有对比算法都取得了最高的平均结果。可见所提算法能够有效地提高不平衡分类性能。

关键词: 不平衡分类, 半监督聚类, 伪标签样本, 过采样, 融合

Abstract:

Imbalanced classification is one of the research hotspots in the field of machine learning， where oversampling increases minority samples through repeated extraction or artificial synthesis to rebalance the dataset. However， most of the existing oversampling methods are based on the original data distribution， and are difficult to reveal more dataset distribution characteristics. To address the above problem， firstly， an improved semi-supervised clustering algorithm was proposed to mine the data distribution characteristics. Secondly， based on the results of semi-supervised clustering， the highly-confident unlabeled data （pseudo-labeled samples） was selected from minority-class clusters to join into the original training set. In this way， in addition to rebalancing the dataset， the distribution characteristics obtained by semi-supervised clustering was able to be used to assist the imbalanced classification. Finally， the results of semi-supervised clustering and classification were fused to predict the final labels， which further improved the model performance of imbalanced classification. With G-mean and Area Under Curve （AUC） selected as evaluation indicators， the proposed algorithm was compared with seven oversampling-/undersampling-based imbalanced classification algorithms， such as TU （Trainable Undersampling） and CDSMOTE （Class Decomposition Synthetic Minority Oversampling TEchnique） on 10 public datasets. Experimental results show that compared with TU and CDSMOTE， the proposed algorithm has the average AUC increased by 6.7% and 3.9% respectively， the average G-mean improved by 7.6% and 2.1% respectively. At the same time， the proposed algorithm achieves the highest average results on both evaluation indicators than all the algorithms to be compared. It can be seen that the proposed algorithm can effectively improve the imbalanced classification performance.

Key words: imbalanced classification, semi-supervised clustering, pseudo-labeled sample, oversampling, fusion

中图分类号:

TP181

陆宇, 赵凌云, 白斌雯, 姜震. 基于改进的半监督聚类的不平衡分类算法[J]. 计算机应用, 2022, 42(12): 3750-3755.

Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG. Imbalanced classification algorithm based on improved semi-supervised clustering[J]. Journal of Computer Applications, 2022, 42(12): 3750-3755.

图/表 4

参考文献 23

1	KAUR H， PANNU H S， MALHI A K. A systematic review on imbalanced data challenges in machine learning［J］. ACM Computing Surveys， 2020， 52（3）： No.79. 10.1145/3343440
2	AHMED F， SHAMSUDDIN R. A comparative study of credit card fraud detection using the combination of machine learning techniques with data imbalance solution［C］// Proceedings of the 2nd International Conference on Computing and Data Science. Piscataway： IEEE， 2021： 112-118. 10.1109/cds52072.2021.00026
3	TAO X M， LI Q， GUO W J， et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification［J］. Information Sciences， 2019， 487：31-56. 10.1016/j.ins.2019.02.062
4	XU X L， CHEN W， SUN Y F， et al. Over-sampling algorithm for imbalanced data classification［J］. Journal of Systems Engineering and Electronics， 2019， 30（6）：1182-1191. 10.21629/jsee.2019.06.12
5	HUI H， WANG W Y， MAO B H. Borderline-SMOTE： a new over-sampling method in imbalanced data sets learning［C］// Proceedings of the 2005 International Conference on Intelligent Computing， LNCS 3644. Berlin： Springer， 2005：878-887.
6	HE H B， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning［C］// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. Piscataway： IEEE， 2008：1322-1328. 10.1109/ijcnn.2008.4633969
7	SOLTANZADEH P， HASHEMZADEH M. RCSMOTE： Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem［J］. Information Sciences， 2020， 542：92-111. 10.1016/j.ins.2020.07.014
8	IRANMEHR A， MASNADI-SHIRAZI H， VASCONCELOS N. Cost-sensitive support vector machines［J］. Neurocomputing， 2019， 343： 50-64. 10.1016/j.neucom.2018.11.099
9	YANG K X， YU Z W， CHEN C L P， et al. Incremental weighted ensemble broad learning system for imbalanced data［J］. IEEE Transactions on Knowledge and Data Engineering， 2021（Early Access）：1-1.
10	DHAR S， CHERKASSKY V. Development and evaluation of cost-sensitive Universum-SVM［J］. IEEE Transactions on Cybernetics， 2015， 45（4）：806-818. 10.1109/tcyb.2014.2336876
11	NÚÑEZ H， GONZALEZ-ABRIL L， ANGULO C. Improving SVM classification on imbalanced datasets by introducing a new bias［J］. Journal of Classification， 2017， 34（3）：427-443. 10.1007/s00357-017-9242-x
12	FERNANDES E R Q， DE CARVALHO A C P L F， YAO X. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data［J］. IEEE Transactions on Knowledge and Data Engineering， 2020， 32（6）： 1104-1115. 10.1109/tkde.2019.2898861
13	SUN Y M， KAMEL M S， WONG A K C， et al. Cost-sensitive boosting for classification of imbalanced data［J］. Pattern Recognition， 2007， 40（12）：3358-3378. 10.1016/j.patcog.2007.04.009
14	王忠震，黄勃，方志军，等. 改进SMOTE的不平衡数据集成分类算法［J］. 计算机应用， 2019， 39（9）： 2591-2596. 10.11772/j.issn.1001-9081.2019030531
	WANG Z Z， HUANG B， FANG Z J， et al. Improved SMOTE unbalanced data integration classification algorithm［J］. Journal of Computer Applications， 2019， 39（9）： 2591-2596. 10.11772/j.issn.1001-9081.2019030531
15	CHAWLA N V， BOWYER K W， HALL L O， et al. SMOTE： synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research， 2002， 16：321-357. 10.1613/jair.953
16	ELYAN E， MORENO-GARCIA C F， JAYNE C. CDSMOTE： class decomposition and synthetic minority class oversampling technique for imbalanced-data classification［J］. Neural Computing and Applications， 2021， 33（7）：2839-2851. 10.1007/s00521-020-05130-z
17	PENG M L， ZHANG Q， XING X Y， et al. Trainable undersampling for class-imbalance learning［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019： 4707-4714. 10.1609/aaai.v33i01.33014707
18	LAST F， DOUZAS G， BACAO F. Oversampling for imbalanced learning based on k-means and SMOTE［EB/OL］. （2017-12-12）［2021-10-11］..
19	TAO X M， LI Q， GUO W J， et al. Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering［J］. Information Sciences， 2020， 519：43-73. 10.1016/j.ins.2020.01.032
20	TSAI C F， LIN W C， HU Y H， et al. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection［J］. Information Sciences， 2019， 477：47-54. 10.1016/j.ins.2018.10.029
21	WAGSTAFF K， CARDIE C， ROGERS S， et al. Constrained k-means clustering with background knowledge［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001：577-584.
22	BASU S， BANERJEE A， MOONEY R. Semi-supervised clustering by seeding［C］// Proceedings of the 19th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2002：19-26.
23	NGUYEN H M， COOPER E W， KAMEI K. Borderline over-sampling for imbalanced data classification［J］. International Journal of Knowledge Engineering and Soft Data Paradigms， 2011， 3（1）：4-21. 10.1504/ijkesdp.2011.039875

数据集	样本数	特征数	类别	IR
ecoli-0-4-6vs5	203	6	1∶-1	9.15
dermatology-6	358	34	1∶-1	16.90
pima	768	8	1∶-1	1.87
yeast5	1 484	8	1∶-1	32.73
abalone19	4 174	8	1∶-1	129.44
pageblock1	5 473	10	1∶others	15.60
pageblock2	5 473	10	2∶others	8.77
HTRU2	17 898	10	1∶-1	9.92
letter4	15 000	16	4∶others	24.08
ijcnn1	49 990	22	1∶-1	9.30

数据集	样本数	特征数	类别	IR
ecoli-0-4-6vs5	203	6	1∶-1	9.15
dermatology-6	358	34	1∶-1	16.90
pima	768	8	1∶-1	1.87
yeast5	1 484	8	1∶-1	32.73
abalone19	4 174	8	1∶-1	129.44
pageblock1	5 473	10	1∶others	15.60
pageblock2	5 473	10	2∶others	8.77
HTRU2	17 898	10	1∶-1	9.92
letter4	15 000	16	4∶others	24.08
ijcnn1	49 990	22	1∶-1	9.30

数据集	B- SMOTE	SVM- SMOTE	K-Means SMOTE	ADASYN	RCSSMOTE	TU	CDSMOTE	CS-K-Means	SVM	C_SVM	本文算法
平均值	0.873 8	0.875 7	0.842 7	0.895 7	0.916 4	0.882 2	0.905 9	0.833 9	0.830 1	0.932 3	0.9414
ecoli-0-4-6vs5	0.934 6	0.926 2	0.897 3	0.902 7	0.915 2	0.909 6	0.915 2	0.847 7	0.923 5	0.972 1	0.9807
dermatology-6	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.0000
pima	0.728 8	0.716 0	0.726 5	0.695 5	0.756 3	0.735 3	0.752 0	0.734 6	0.717 8	0.752 3	0.7714
yeast5	0.735 1	0.749 7	0.759 7	0.962 2	0.9872	0.843 6	0.987 1	0.784 3	0.602 3	0.986 9	0.974 4
abalone19	0.649 5	0.641 4	0.597 9	0.723 8	0.716 2	0.511 5	0.758 1	0.521 1	0.525 4	0.8085	0.804 8
pageblock1	0.953 5	0.961 8	0.854 1	0.953 7	0.929 9	0.947 3	0.9638	0.924 7	0.917 4	0.959 6	0.963 1
pageblock2	0.923 0	0.941 8	0.850 8	0.920 0	0.978 7	0.9873	0.886 9	0.859 1	0.908 2	0.964 4	0.971 3
HTRU2	0.918 9	0.922 9	0.893 8	0.904 7	0.965 6	0.951 7	0.932 3	0.955 6	0.905 3	0.930 7	0.9708
letter_4	0.929 8	0.931 7	0.926 3	0.940 4	0.952 6	0.957 2	0.911 3	0.900 9	0.900 4	0.987 4	0.9956
ijcnn1	0.964 5	0.965 7	0.920 6	0.953 7	0.962 3	0.978 1	0.952 1	0.811 4	0.901 1	0.961 0	0.9817

数据集	B- SMOTE	SVM- SMOTE	K-Means SMOTE	ADASYN	RCSSMOTE	TU	CDSMOTE	CS-K-Means	SVM	C_SVM	本文算法
平均值	0.873 8	0.875 7	0.842 7	0.895 7	0.916 4	0.882 2	0.905 9	0.833 9	0.830 1	0.932 3	0.9414
ecoli-0-4-6vs5	0.934 6	0.926 2	0.897 3	0.902 7	0.915 2	0.909 6	0.915 2	0.847 7	0.923 5	0.972 1	0.9807
dermatology-6	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.0000
pima	0.728 8	0.716 0	0.726 5	0.695 5	0.756 3	0.735 3	0.752 0	0.734 6	0.717 8	0.752 3	0.7714
yeast5	0.735 1	0.749 7	0.759 7	0.962 2	0.9872	0.843 6	0.987 1	0.784 3	0.602 3	0.986 9	0.974 4
abalone19	0.649 5	0.641 4	0.597 9	0.723 8	0.716 2	0.511 5	0.758 1	0.521 1	0.525 4	0.8085	0.804 8
pageblock1	0.953 5	0.961 8	0.854 1	0.953 7	0.929 9	0.947 3	0.9638	0.924 7	0.917 4	0.959 6	0.963 1
pageblock2	0.923 0	0.941 8	0.850 8	0.920 0	0.978 7	0.9873	0.886 9	0.859 1	0.908 2	0.964 4	0.971 3
HTRU2	0.918 9	0.922 9	0.893 8	0.904 7	0.965 6	0.951 7	0.932 3	0.955 6	0.905 3	0.930 7	0.9708
letter_4	0.929 8	0.931 7	0.926 3	0.940 4	0.952 6	0.957 2	0.911 3	0.900 9	0.900 4	0.987 4	0.9956
ijcnn1	0.964 5	0.965 7	0.920 6	0.953 7	0.962 3	0.978 1	0.952 1	0.811 4	0.901 1	0.961 0	0.9817

数据集	B- SMOTE	SVM- SMOTE	K-Means SMOTE	ADASYN	RCSSMOTE	TU	CDSMOTE	CS-K-Means	SVM	C_SVM	本文算法
平均值	0.851 8	0.852 8	0.821 5	0.879 6	0.837 8	0.834 6	0.879 8	0.749 5	0.791 0	0.894 3	0.8981
ecoli-0-4-6vs5	0.9306	0.920 4	0.897 3	0.893 0	0.885 3	0.904 0	0.858 7	0.827 8	0.888 1	0.912 1	0.916 9
dermatology-6	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0	1.000 0
pima	0.691 5	0.690 7	0.709 7	0.665 3	0.705 0	0.702 1	0.753 9	0.693 3	0.8357	0.705 6	0.693 9
yeast5	0.673 6	0.684 7	0.705 2	0.9617	0.596 4	0.827 5	0.811 2	0.807 6	0.722 3	0.933 3	0.947 1
abalone19	0.581 7	0.568 7	0.497 3	0.658 5	0.612 2	0.215 6	0.7674	0.312 5	0.000 0	0.660 2	0.659 3
pageblock1	0.916 0	0.901 7	0.839 1	0.909 0	0.890 1	0.935 5	0.943 7	0.817 3	0.895 3	0.949 2	0.9501
pageblock2	0.861 0	0.902 1	0.783 3	0.860 0	0.883 6	0.9348	0.884 7	0.662 6	0.842 5	0.923 9	0.925 5
HTRU2	0.918 9	0.922 8	0.888 3	0.904 2	0.903 5	0.896 9	0.930 8	0.898 2	0.892 9	0.929 0	0.9315
letter_4	0.979 7	0.971 7	0.976 2	0.980 3	0.945 6	0.956 3	0.939 8	0.852 6	0.916 2	0.972 4	0.9812
ijcnn1	0.964 5	0.965 6	0.918 7	0.964 4	0.956 2	0.973 5	0.907 5	0.623 3	0.917 1	0.957 1	0.9752

基于改进的半监督聚类的不平衡分类算法

Imbalanced classification algorithm based on improved semi-supervised clustering

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 23

相关文章 15

编辑推荐

Metrics

[1]	孟昱煜, 郭静. 信息熵改进主成分分析模型的链路预测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2823-2829.
[2]	强赞霞, 鲍先富. 基于卷积长短期记忆的残差注意力去雨网络[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2858-2864.
[3]	吴明晖, 张广洁, 金苍宏. 基于多模态信息融合的时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2326-2332.
[4]	杨瑞杰, 郑贵林. 基于InceptionV3和特征融合的人脸活体检测[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2037-2042.
[5]	潘高峰, 樊渊, 汝玉, 郭予超. 基于点线特征融合的低纹理单目视觉同时定位与地图构建算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2170-2176.
[6]	张达为, 刘绪崇, 周维, 陈柱辉, 余瑶. 基于改进YOLOv3的实时交通标志检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2219-2226.
[7]	陈海龙, 杨畅, 杜梅, 张颖宇. 基于边界自适应SMOTE和Focal Loss函数改进LightGBM的信用风险预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2256-2264.
[8]	狄筝, 曹一凡, 仇超, 罗韬, 王晓飞. 新型算力网络架构及其应用案例分析[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1656-1661.
[9]	谢新林, 肖毅, 续欣莹. 基于神经网络架构搜索的肺结节分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1424-1430.
[10]	焦守龙, 段友祥, 孙歧峰, 庄子浩, 孙琛皓. 融合实体描述信息和邻居节点特征的知识表示学习方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1050-1056.
[11]	刘志华, 陈文洁, 陈爱斌. 基于自注意力机制时频谱同源特征融合的鸟鸣声分类[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1260-1268.
[12]	丁行硕, 李翔, 谢乾. 基于标签分层延深建模的企业画像构建方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1170-1177.
[13]	胡新荣, 张君宇, 彭涛, 刘军平, 何儒汉, 何凯. 级联跨域特征融合的虚拟试衣[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1269-1274.
[14]	张璐, 方春, 祝铭. 基于Res2Net-YOLACT和融合特征的室内跌倒检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 757-763.
[15]	杨鼎康, 黄帅, 王顺利, 翟鹏, 李一丹, 张立华. 基于生成对抗网络和网络集成的面部表情识别方法EE-GAN[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 750-756.