基于凝聚式层次聚类的微调筛选过采样方法

doi:10.11772/j.issn.1001-9081.2024070919

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2138-2144.DOI: 10.11772/j.issn.1001-9081.2024070919

• CCF第39届中国计算机应用大会 (CCF NCCA 2024) • 上一篇下一篇

基于凝聚式层次聚类的微调筛选过采样方法

谷铮¹^,²^,³, 陈学斌¹^,²^,³(), 张宏扬¹^,²^,³, 李雨欣¹^,²^,³

^1.华北理工大学理学院，河北唐山 063210
^2.河北省数据科学与应用重点实验室（华北理工大学），河北唐山 063210
^3.唐山市数据科学重点实验室（华北理工大学），河北唐山 063210

收稿日期:2024-07-03 修回日期:2024-09-25 接受日期:2024-09-29 发布日期:2025-07-10 出版日期:2025-07-10
通讯作者: 陈学斌
作者简介:谷铮（1999—），女，河北廊坊人，硕士研究生，CCF会员，主要研究方向：数据分析
张宏扬（1999—），男，江苏淮安人，硕士研究生，CCF会员，主要研究方向：数据安全、隐私保护
李雨欣（2000—），女，山西临汾人，硕士研究生，CCF会员，主要研究方向：数据分析。
基金资助:
国家自然科学基金资助项目(U20A20179)

Fine-tuned and filtered oversampling method based on agglomerative hierarchical clustering

Zheng GU¹^,²^,³, Xuebin CHEN¹^,²^,³(), Hongyang ZHANG¹^,²^,³, Yuxin LI¹^,²^,³

^1.College of Sciences，North China University of Science and Technology，Tangshan Hebei 063210，China
^2.Hebei Provincial Key Laboratory of Data Science and Application （North China University of Science and Technology），Tangshan Hebei 063210，China
^3.Tangshan Key Laboratory of Data Science （North China University of Science and Technology），Tangshan Hebei 063210，China

Received:2024-07-03 Revised:2024-09-25 Accepted:2024-09-29 Online:2025-07-10 Published:2025-07-10
Contact: Xuebin CHEN
About author:GU Zheng， born in 1999， M. S. candidate. Her research interests include data analysis.
ZHANG Hongyang， born in 1999， M. S. candidate. His research interests include data security， privacy protection.
LI Yuxin， born in 2000， M. S. candidate. Her research interests include data analysis.
Supported by:
This work is partially supported by National Natural Science Foundation of China(U20A20179)

摘要/Abstract

摘要：

针对不平衡数据集分类效果差的问题，提出一种基于凝聚式层次聚类（AHC）的微调筛选过采样方法，该方法可适用于不平衡数据的多分类情况。首先，在不平衡数据集的聚类过程中应用AHC算法，分别聚类多数类与少数类，从而在考虑类别间关系的同时有效避免类重叠问题；其次，为了平衡数据集并保留原始数据的特征，设计一种微调过采样算法；再次，为了提升生成样本的分类准确率，提出一种基于倾向评分匹配的标签倾向评估与筛选方法；最后，通过实验对所提出的方法进行验证，并将该方法与MDO（Mahalanobis Distance-based Over-sampling technique）、AND-SMOTE （Automatic Neighborhood size Determination method for Synthetic Minority Over-sampling TEchnique）和K-means SMOTE这3种方法进行比较。实验结果表明，在Abalone、Contraceptive和Yeast等6个不同的数据集上，所提方法展现出了良好的性能，验证了它的有效性。

关键词: 不平衡数据, 多分类, 过采样, 凝聚式层次聚类, 标签倾向评估

Abstract:

A fine-tuned and filtered oversampling method based on Agglomerative Hierarchical Clustering （AHC） was proposed to address the issue of poor classification performance on imbalanced datasets， which can be applied to multi-class imbalanced data scenarios. Firstly， AHC algorithm was employed during the clustering process of imbalanced datasets， so that the majority and minority classes were clustered separately， thereby avoiding class overlap effectively while considering inter-class relationships. Secondly， to balance the dataset while preserving characteristics of the original data， a fine-tuned oversampling algorithm was designed. Thirdly， to improve classification accuracy of the generated samples， a label tendency evaluation and filtering method based on propensity score matching was introduced. Finally， the proposed method was validated through experiments and compared with three methods： MDO （Mahalanobis Distance-based Over-sampling technique）， AND-SMOTE （Automatic Neighborhood size Determination method for Synthetic Minority Over-sampling TEchnique）， and K-means SMOTE. Experimental results demonstrate that the proposed method has excellent performance on six different datasets such as Abalone， Contraceptive and Yeast， confirming effectiveness of the method.

Key words: imbalanced data, multi-class classification, oversampling, Agglomerative Hierarchical Clustering (AHC), label bias assessment

中图分类号:

TP301.6

谷铮, 陈学斌, 张宏扬, 李雨欣. 基于凝聚式层次聚类的微调筛选过采样方法[J]. 计算机应用, 2025, 45(7): 2138-2144.

Zheng GU, Xuebin CHEN, Hongyang ZHANG, Yuxin LI. Fine-tuned and filtered oversampling method based on agglomerative hierarchical clustering[J]. Journal of Computer Applications, 2025, 45(7): 2138-2144.

图/表 6

参考文献 25

[1]	CHANDOLA V， BANERJEE A， KUMAR V. Anomaly detection： a survey ［J］. ACM Computing Surveys， 2009， 41（3）： No.15.
[2]	SHATNAWI R. Improving software fault-prediction for imbalanced data ［C］// Proceedings of the 2012 International Conference on Innovations in Information Technology. Piscataway： IEEE， 2012： 54-59.
[3]	FAWCETT T， PROVOST F. Adaptive fraud detection ［J］. Data Mining and Knowledge Discovery， 1997， 1（3）： 291-316.
[4]	KRAWCZYK B， GALAR M， JELEŃ Ł， et al. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy ［J］. Applied Soft Computing， 2016， 38： 714-726.
[5]	VUTTIPITTAYAMONGKOL P， ELYAN E. Overlap-based undersampling method for classification of imbalanced medical datasets ［C］// Proceedings of the 2020 International Conference on Artificial Intelligence Applications and Innovations， IFIPAICT 584. Cham： Springer， 2020： 358-369.
[6]	MAJID A， ALI S， IQBAL M， et al. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines ［J］. Computer Methods and Programs in Biomedicine， 2014， 113（3）： 792-808.
[7]	LIU Y， LOH H T， SUN A. Imbalanced text classification： a term weighting approach ［J］. Expert Systems with Applications， 2009， 36（1）： 690-701.
[8]	MEHMOOD Z， ASGHAR S. Customizing SVM as a base learner with AdaBoost ensemble to learn from multi-class problems： a hybrid approach AdaBoost-MSVM ［J］. Knowledge-Based Systems， 2021， 217： No.106845.
[9]	PURWAR A， SINGH S K. A novel ensemble classifier by combining sampling and genetic algorithm to combat multiclass imbalanced problems ［J］. International Journal of Data Analysis Techniques and Strategies， 2020， 12（1）： 30-42.
[10]	CHAWLA N V， BOWYER K W， HALL L O， et al. SMOTE： synthetic minority over-sampling technique ［J］. Journal of Artificial Intelligence Research， 2002， 16： 321-357.
[11]	HAN H， WANG W Y， MAO B H. Borderline-SMOTE： a new over-sampling method in imbalanced data sets learning ［C］// Proceedings of the 2005 International Conference on Intelligent Computing， LNCS 3644. Berlin： Springer， 2005： 878-887.
[12]	HE H， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning ［C］// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. Piscataway： IEEE， 2008： 1322-1328.
[13]	MA L， FAN S. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests ［J］. BMC Bioinformatics， 2017， 18： No.169.
[14]	MACIEJEWSKI T， STEFANOWSKI J. Local neighbourhood extension of SMOTE for mining imbalanced data ［C］// Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining. Piscataway： IEEE， 2011： 104-111.
[15]	ABDI L， HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling techniques ［J］. IEEE Transactions on Knowledge and Data Engineering， 2016， 28（1）： 238-251.
[16]	NEKOOEIMEHR I， LAI-YUEN S K. Adaptive Semi-Unsupervised Weighted Oversampling （A-SUWO） for imbalanced datasets ［J］. Expert Systems with Applications， 2016， 46： 405-416.
[17]	DOUZAS G， BACAO F， LAST F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE ［J］. Information Sciences， 2018， 465： 1-20.
[18]	YUN J， HA J， LEE J S. Automatic determination of neighborhood size in SMOTE ［C］// Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication. New York： ACM， 2016： No.100.
[19]	TSAI C F， LIN W C， HU Y H， et al. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection ［J］. Information Sciences， 2019， 477： 47-54.
[20]	XIE X， LIU H， ZENG S， et al. A novel progressively undersampling method based on the density peaks sequence for imbalanced data ［J］. Knowledge-Based Systems， 2021， 213： No.106689.
[21]	MOUTAOUAKIL K EL， ROUDANI M， OUISSARI A EL. Optimal Entropy Genetic Fuzzy-C-Means SMOTE （OEGFCM-SMOTE）［J］. Knowledge-Based Systems， 2023， 262： No.110235.
[22]	SALEHI A R， KHEDMATI M. A Cluster-based SMOTE Both-Sampling （CSBBoost） ensemble algorithm for classifying imbalanced data ［J］. Scientific Reports， 2024， 14： No.5152.
[23]	VOORHEES E M. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval ［J］. Information Processing and Management， 1986， 22（6）： 465-476.
[24]	陈静纯，袁春锋，谈芳，等.基于倾向评分匹配法的家庭医生签约对居民医疗健康服务获得感的影响［J］.现代预防医学，2023， 50（18）： 3347-3351.
	CHEN J C， YUAN C F， TAN F， et al. The impact of contracted family doctor on residents’ medical and health service acquisition sense based on propensity score matching method ［J］. Modern Preventive Medicine， 2023， 50（18）： 3347-3351.
[25]	李蒙蒙，刘艺，李庚松，等.不平衡多分类算法综述［J］.计算机应用，2022， 42（11）： 3307-3321.
	LI M M， LIU Y， LI G S， et al. Survey on imbalanced multi-class classification algorithms ［J］. Journal of Computer Applications， 2022， 42（11）： 3307-3321.

分类	预测正例	预测反例
真正例	TP	FN
真反例	FP	TN

分类	预测正例	预测反例
真正例	TP	FN
真反例	FP	TN

数据集	样本数	属性数	分类数	IR
Abalone	4 177	8	3	2.10
Contraceptive	1 473	9	3	1.89
Dermatology	366	33	6	5.55
Glass	214	9	7	8.44
Vertebral	310	6	3	2.50
Yeast	1 484	8	10	23.15

数据集	样本数	属性数	分类数	IR
Abalone	4 177	8	3	2.10
Contraceptive	1 473	9	3	1.89
Dermatology	366	33	6	5.55
Glass	214	9	7	8.44
Vertebral	310	6	3	2.50
Yeast	1 484	8	10	23.15

数据集	评价指标	方法本文	MDO	AND-SMOTE	K-mean SMOTE
Abalone	G-mean	60.81	55.32	56.84	57.92
	F1	61.10	57.94	58.46	59.54
	MAUC	77.84	73.99	74.74	76.12
Contraceptive	G-mean	69.72	57.34	62.38	63.76
	F1	69.77	58.67	63.04	64.91
	MAUC	83.55	75.83	79.50	78.99
Dermatology	G-mean	99.10	97.26	97.80	98.06
	F1	99.26	97.79	98.01	97.67
	MAUC	99.97	99.78	99.85	99.82
Glass	G-mean	91.75	86.32	91.88	85.23
	F1	94.37	87.55	89.10	84.26
	MAUC	97.97	93.82	95.05	93.25
Vertebral	G-mean	90.38	84.44	91.43	88.46
	F1	94.46	86.63	94.43	88.68
	MAUC	96.42	92.61	97.68	96.67
Yeast	G-mean	88.87	84.38	86.63	84.51
	F1	90.43	86.33	87.88	85.89
	MAUC	98.19	89.02	93.25	91.34

基于凝聚式层次聚类的微调筛选过采样方法

Fine-tuned and filtered oversampling method based on agglomerative hierarchical clustering

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 25

相关文章 15

编辑推荐

Metrics

数据集	评价指标	方法本文	MDO	AND-SMOTE	K-means SMOTE
Abalone	G-mean	59.95	55.09	55.74	56.53
	F1	61.37	56.85	56.70	57.79
	MAUC	76.85	71.77	72.43	74.15
Contraceptive	G-mean	70.24	59.26	60.56	64.20
	F1	68.90	60.71	63.28	65.83
	MAUC	83.31	77.71	79.39	79.89
Dermatology	G-mean	98.89	97.41	98.73	98.72
	F1	99.26	97.05	98.74	98.43
	MAUC	99.99	99.43	99.97	99.95
Glass	G-mean	91.62	85.34	86.09	84.16
	F1	93.08	84.26	85.12	83.49
	MAUC	98.10	92.64	92.85	91.66
Vertebral	G-mean	88.19	81.39	89.83	87.02
	F1	94.46	87.78	93.34	86.75
	MAUC	96.78	91.59	97.11	96.01
Yeast	G-mean	87.66	79.85	82.84	82.76
	F1	90.01	85.84	84.79	83.67
	MAUC	98.03	88.69	91.74	90.28

[1]	李道全, 徐正, 陈思慧, 刘嘉宇. 融合变分自编码器与自适应增强卷积神经网络的网络流量分类模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1841-1848.
[2]	孙淳, 胡春龙, 黄树成. 一致性保留的集成排序年龄估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2381-2386.
[3]	冷强奎, 孙薛梓, 孟祥福. 基于样本势和噪声进化的不平衡数据过采样方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2466-2475.
[4]	雷明珠, 王浩, 贾蓉, 白琳, 潘晓英. 基于特征间关系合成少数类样本的过采样算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1428-1436.
[5]	陈美宏, 袁凌云, 夏桐. 基于主从多链的数据分类分级访问控制模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1148-1157.
[6]	郭祥, 姜文刚, 王宇航. 基于改进Inception-ResNet的加密流量分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2471-2476.
[7]	穆栋梁, 韩萌, 李昂, 刘淑娟, 高智慧. 概念漂移复杂数据流分类方法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1664-1675.
[8]	陈海龙, 杨畅, 杜梅, 张颖宇. 基于边界自适应SMOTE和Focal Loss函数改进LightGBM的信用风险预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2256-2264.
[9]	刘学文, 王继奎, 杨正国, 李强, 易纪海, 李冰, 聂飞平. 密度峰值优化的球簇划分欠采样不平衡数据分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1455-1463.
[10]	李懿恒, 杜晨曦, 杨燕燕, 李翔宇. 基于伪标签一致度的不平衡数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 475-484.
[11]	陆宇, 赵凌云, 白斌雯, 姜震. 基于改进的半监督聚类的不平衡分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3750-3755.
[12]	李蒙蒙, 刘艺, 李庚松, 郑奇斌, 秦伟, 任小广. 不平衡多分类算法综述[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3307-3321.
[13]	肖振远, 王逸涵, 罗建桥, 熊鹰, 李柏林. 基于部分加权损失函数的RefineDet[J]. 计算机应用, 2021, 41(7): 1928-1932.
[14]	王垚, 孙国梓. 基于聚类和实例硬度的入侵检测过采样方法[J]. 计算机应用, 2021, 41(6): 1709-1714.
[15]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.