Imbalanced data classification method based on Lasso and constructive covering algorithm

doi:10.11772/j.issn.1001-9081.2022040490

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (4): 1086-1093.DOI: 10.11772/j.issn.1001-9081.2022040490

• Data science and technology • Previous Articles

Imbalanced data classification method based on Lasso and constructive covering algorithm

Yi JIANG¹, Shuping WU¹(), Kun HU², Linbo LONG¹

^1.College of Computer Science and Technology，Chongqing University of Posts and Telecommunications，Chongqing 400065，China
^2.Cloud Computing Center of Yunnan Branch，China Telecom Corporation Limited，Kunming Yunnan 650200，China

Received:2022-04-14 Revised:2022-06-08 Accepted:2022-06-13 Online:2022-07-01 Published:2023-04-10
Contact: Shuping WU
About author:JIANG Yi， born in 1969， Ph. D.， senior engineer. His research interests include computer architecture， software engineering， big data， network security.
HU Kun， born in 1970， senior engineer. His research interests include cloud computing.
LONG Linbo， born in 1988， Ph. D.， associate professor. His research interests include compiler optimization， new storage technologies， big data， embedded systems.
Supported by:
National Natural Science Foundation of China(61902045);Chongqing Technology Innovation and Application Development Special Key Project(cstc2019jscx-mbdxX0035)

基于Lasso和构造性覆盖算法的不均衡数据分类方法

蒋溢¹, 伍书平¹(), 胡昆², 龙林波¹

^1.重庆邮电大学计算机科学与技术学院，重庆 400065
^2.中国电信股份有限公司云南分公司云计算中心，昆明 650200

通讯作者: 伍书平
作者简介:蒋溢（1969—），男，湖北安陆人，正高级工程师，博士，CCF会员，主要研究方向：计算机体系结构、软件工程、大数据、网络安全；
胡昆（1970—），男，云南昆明人，高级工程师，主要研究方向：云计算；
龙林波（1988—），男，重庆人，副教授，博士，主要研究方向：编译器优化、新型存储技术、大数据、嵌入式系统。
基金资助:
国家自然科学基金资助项目(61902045);重庆市技术创新与应用发展专项重点项目(cstc2019jscx?mbdxX0035)

Abstract

Abstract:

Aiming at the problem that the machine learning classification algorithms have insufficient ability to identify minority samples in the imbalanced data classification problems， an imbalanced data classification method L-CCSmote （Least absolute shrinkage and selection operator Constructive Covering Synthetic minority oversampling technique） was proposed by taking the telecom customer churn scenario as an example. Firstly， the churn costumer related features were extracted through Lasso （Least absolute shrinkage and selection operator） to optimize the model input. Then， a neural network was built through Constructive Covering Algorithm （CCA） to generate coverages that conformed to the overall distribution of samples. Finally， a single-sample coverage strategy， a sample diversity strategy and a sample density peak strategy were further proposed to perform a hybrid sampling to balance the data. Total of 13 imbalanced datasets and 2 desensitized telecom customer datasets were selected from KEEL data base， and the proposed method was verified on Logistic Regression （LR） and Support Vector Machine （SVM） classification algorithms respectively. On LR classification algorithm， compared with the Synthetic Minority Oversampling TEchnique Edited nearest neighbor （SMOTE-Enn）， the proposed method had the average Geometric MEAN （G-MEAN） increased by 2.32%. On SVM classification algorithm， compared with the Borderline-SMOTE （Borderline Synthetic Minority Oversampling Technique）， the proposed method had the average G-MEAN increased by 2.44%. Experimental results show that the proposed method can solve the influence of class skew distribution on classification， and its recognition ability for rare classes is better than that of the classical balanced data classification methods.

Key words: Lasso (Least absolute shrinkage and selection operator), constructive covering algorithm, imbalanced data classification, customer churn prediction, hybrid sampling

摘要：

针对机器学习分类算法在不均衡数据分类问题中对少数类样本识别能力不足的问题，以电信客户流失场景为例，提出一种不均衡数据分类方法L-CCSmote（Lasso Constructive Covering Smote）。首先，通过套索回归（Lasso）提取流失用户特征以优化模型输入；然后，通过构造性覆盖算法（CCA）建立神经网络生成符合样本整体分布的覆盖；最后，进一步提出单样本覆盖策略、样本多样性策略和样本密度峰值策略，通过以上策略混合采样以平衡数据。选用了KEEL数据库中的13个不均衡数据集和2个脱敏电信客户数据集，分别在逻辑回归（LR）和支持向量机（SVM）分类算法上对该方法进行验证。在LR分类算法上，与SMOTE-Enn（Synthetic Minority Oversampling TEchnique Edited nearest neighbor）相比，所提方法的平均几何平均值（G-MEAN）提升了2.32%；在SVM分类算法上，与Borderline-SMOTE（Borderline Synthetic Minority Oversampling Technique Edited）相比，所提方法的平均G-MEAN提升了2.44%。实验结果表明，所提方法能解决类别偏斜分布影响分类的问题，且对于稀有类的识别能力优于经典平衡数据方法。

关键词: Lasso, 构造性覆盖算法, 不均衡数据分类, 客户流失预测, 混合采样

CLC Number:

TP301.6

Yi JIANG, Shuping WU, Kun HU, Linbo LONG. Imbalanced data classification method based on Lasso and constructive covering algorithm[J]. Journal of Computer Applications, 2023, 43(4): 1086-1093.

蒋溢, 伍书平, 胡昆, 龙林波. 基于Lasso和构造性覆盖算法的不均衡数据分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1086-1093.

Figures/Tables 9

References 28

1	AlSHOURBAJI I， HELIAN N， SUN Y， et al. Anovel HEOMGA approach for class imbalance problem in the application of customer churn prediction［J］. SN Computer Science， 2021， 2（6）： No.464. 10.1007/s42979-021-00850-y
2	TRAN T C， DANG T K. Machine learning for prediction of imbalanced data： credit fraud detection［C］// Proceedings of the 15th International Conference on Ubiquitous Information Management and Communication. Piscataway： IEEE， 2021：1-7. 10.1109/imcom51814.2021.9377352
3	LIU H， LIU Z Y， JIA W Q， et al. A novel imbalanced data classification method based on weakly supervised learning for fault diagnosis［J］. IEEE Transactions on Industrial Informatics， 2022， 18（3）：1583-1593. 10.1109/tii.2021.3084132
4	江昊琛，魏子麒，刘璘，等. 非均衡数据分类经典方法综述与面向医疗领域的实验分析［J］. 计算机科学， 2022， 49（1）： 80-88. 10.11896/jsjkx.210200124
	JIANG H C， WEI Z Q， LIU L， et al. Imbalanced data classification： a survey and experiments in medical domain［J］. Computer Science， 2022， 49（1）：80-88. 10.11896/jsjkx.210200124
5	BUREZ J， D van den POEL. Handling class imbalance in customer churn prediction［J］. Expert Systems with Applications， 2009， 36（3 Pt 1）： 4626-4636. 10.1016/j.eswa.2008.05.027
6	KHAN S H， HAYAT M， BENNAMOUN M， et al. Cost-sensitive learning of deep feature representations from imbalanced data［J］. IEEE Transactions on Neural Networks and Learning Systems， 2018， 29（8）： 3573-3587. 10.1109/tnnls.2017.2732482
7	CHAWLA N V， BOWYER K W， HALL L O， et al. SMOTE： synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research， 2002， 16： 321-357. 10.1613/jair.953
8	HAN H， WANG W Y， MAO B H. Borderline-SMOTE： a new over-sampling method in imbalanced data sets learning［C］// Proceedings of the 2005 International Conference on Intelligent Computing， LNCS 3644. Berlin： Springer， 2005： 878-887.
9	HE H B， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning［C］// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. Piscataway： IEEE， 2008： 1322-1328. 10.1109/ijcnn.2008.4633969
10	严远亭，朱原玮，吴增宝，等. 构造性覆盖算法的SMOTE过采样方法［J］. 计算机科学与探索， 2020， 14（6）： 975-984. 10.3778/j.issn.1673-9418.1905091
	YAN Y T， ZHU Y W， WU Z B， et al. Constructive covering algorithm-based SMOTE over-sampling method［J］. Journal of Frontiers of Computer Science and Technology， 2020， 14（6）： 975-984. 10.3778/j.issn.1673-9418.1905091
11	TAO X M， ZHENG Y J， CHEN W， et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning［J］. Information Sciences， 2022， 588： 13-51. 10.1016/j.ins.2021.12.066
12	WILSON D L. Asymptotic properties of nearest neighbor rules using edited data［J］. IEEE Transactions on Systems， Man， and Cybernetics， 1972， SMC-2（3）： 408-421. 10.1109/tsmc.1972.4309137
13	LI L S， HE H B， LI J. Entropy-based sampling approaches for multi-class imbalanced problems［J］. IEEE Transactions on Knowledge and Data Engineering， 2020， 32（11）： 2159-2170. 10.1109/tkde.2019.2913859
14	BATISTA G E A P A， PRATI R C， MONARD M C. A study of the behavior of several methods for balancing machine learning training data［J］. ACM SIGKDD Explorations Newsletter， 2004， 6（1）： 20-29. 10.1145/1007730.1007735
15	AWALLUDIN， ADIWIJAYA， BIJAKSANA M A， et al. Churn prediction on fixed broadband internet using combined feed-forward neural network and SMOTEBoost algorithm［C］// Proceedings of the 5th International Conference on Information and Communication Technology. Piscataway： IEEE， 2017： 1-6. 10.1109/icoict.2017.8074672
16	WANG J Z， WANG R， LI Z W. A combined forecasting system based on multi-objective optimization and feature extraction strategy for hourly PM_2.5 concentration［J］. Applied Soft Computing， 2022， 114： No.108034. 10.1016/j.asoc.2021.108034
17	ABBASI J S， BASHIR F， QURESHI K N， et al. Deep learning-based feature extraction and optimizing pattern matching for intrusion detection using finite state machine［J］. Computers and Electrical Engineering， 2021， 92： No.107094. 10.1016/j.compeleceng.2021.107094
18	EFFENDY V， ADIWIJAYA， BAIZAL Z K A. Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest［C］// Proceedings of the 2nd International Conference on Information and Communication Technology. Piscataway： IEEE， 2014： 325-330. 10.1109/icoict.2014.6914086
19	YE H， QU X L， LIU S Z， et al. Hybrid sampling method for autoregressive classification trees under density-weighted curvature distance［J］. Enterprise Information Systems， 2021， 15（5）： 749-768. 10.1080/17517575.2020.1762245
20	DING Z H， RAO R T， YAN Y T， et al. Voting based constructive covering algorithm［C］// Proceedings of the IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering. Piscataway： IEEE， 2019： 720-724. 10.1109/iske47853.2019.9170310
21	YE D J， LIANG D C， LI T， et al. Multi-class decision-making method for decision-theoretic rough sets based on the constructive covering algorithm［J］. IEEE Access， 2020， 8： 57833-57848. 10.1109/access.2020.2982437
22	TIBSHIRANI R. Regression shrinkage and selection via the Las- so： a retrospective［J］. Journal of the Royal Statistical Society B： Series B Statistical Methodology， 2011， 73（3）： 273-282. 10.1111/j.1467-9868.2011.00771.x
23	MISWAN N H， CHAN C S， NG C G. Hospital readmission prediction based on improved feature selection using grey relational analysis and LASSO［J］. Grey Systems： Theory and Application， 2021， 11（4）： 796-812. 10.1108/gs-12-2020-0168
24	OYEDOTUN O K， SHABAYEK A E R， AOUADA D， et al. Deep network compression with teacher latent subspace learning and Lasso［J］. Applied Intelligence， 2021， 51（2）： 834-853. 10.1007/s10489-020-01858-2
25	KONERU B N G， CHANDRACHOODAN N， VASUDEVAN V. A smoothed LASSO-based DNN sparsification technique［J］. IEEE Transactions on Circuits and Systems I： Regular Papers， 2021， 68（10）： 4287-4298. 10.1109/tcsi.2021.3097765
26	张旻，张铃. 构造性覆盖算法的知识发现方法研究［J］. 电子与信息学报， 2006， 28（7）： 1322-1326. 10.1016/S1005-8885(07)60041-7
	ZHANG M， ZHANG L. Study on the method of knowledge discover based on the structured covering algorithm［J］. Journal of Electronics and Information Technology， 2006， 28（7）： 1322-1326. 10.1016/S1005-8885(07)60041-7
27	KUBAT M， HOLTE R C， MATWIN S. Machine learning for the detection of oil spills in satellite radar images［J］. Machine Learning， 1998， 30（2/3）： 195-215. 10.1023/a:1007452223027
28	KUBAT M， MATWIN S. Addressing the curse of imbalanced training sets： one-sided selection［C］// Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco： Morgan Kaufmann， 1997： 179-186. 10.1023/a:1007452223027

数据集	名称缩写	样本数	属性数	不平衡率
Glass1	G1	214	9	1.82
Pima	P1	768	8	1.87
Vehicle3	V3	846	18	2.99
Segment0	S0	2 308	19	6.02
Glass6	G6	214	9	6.38
Yeast3	Y3	1 484	8	8.10
Ecoli3	E3	336	7	8.60
Vowel0	V0	988	13	9.98
Cleveland-0_vs_4	C0	177	13	12.62
Ecoli-0-1-4-6_vs_5	E0	280	6	13.00
Abalone9-18	A9	731	8	16.40
Yeast5	Y5	1 484	8	32.73
Abalone-17_vs_7-8-9-10	A17	2 338	8	39.31
Data1	D1	10 000	13	20.30
Data2	D2	10 900	60	49.05

数据集	名称缩写	样本数	属性数	不平衡率
Glass1	G1	214	9	1.82
Pima	P1	768	8	1.87
Vehicle3	V3	846	18	2.99
Segment0	S0	2 308	19	6.02
Glass6	G6	214	9	6.38
Yeast3	Y3	1 484	8	8.10
Ecoli3	E3	336	7	8.60
Vowel0	V0	988	13	9.98
Cleveland-0_vs_4	C0	177	13	12.62
Ecoli-0-1-4-6_vs_5	E0	280	6	13.00
Abalone9-18	A9	731	8	16.40
Yeast5	Y5	1 484	8	32.73
Abalone-17_vs_7-8-9-10	A17	2 338	8	39.31
Data1	D1	10 000	13	20.30
Data2	D2	10 900	60	49.05

真实标签	预测结果
真实标签	正类	负类
正类	TP（True Positive）	FN（False Negative）
负类	FP（False Positive）	TN（True Negative）

真实标签	预测结果
真实标签	正类	负类
正类	TP（True Positive）	FN（False Negative）
负类	FP（False Positive）	TN（True Negative）

数据集	L-CCSmote		S-Enn		S-Tomek		B1-S		Adasyn		OSS
数据集	LR	SVM	LR	SVM	LR	SVM	LR	SVM	LR	SVM	LR	SVM
AR	1.53	2.13	3.40	3.13	2.60	2.60	3.27	2.53	4.13	2.87	2.73	3.27
G1	0.627 5	0.636 4	0.571 4	0.590 9	0.583 3	0.622 2	0.507 9	0.638 3	0.539 7	0.577 8	0.625 0	0.666 7
P1	0.671 4	0.671 1	0.629 6	0.650 6	0.643 4	0.662 1	0.657 5	0.670 8	0.657 3	0.653 8	0.615 4	0.633 1
V3	0.719 4	0.666 7	0.653 6	0.640 5	0.696 3	0.657 3	0.680 9	0.653 1	0.704 2	0.657 5	0.660 4	0.516 1
S0	0.981 6	0.981 6	0.981 6	0.975 3	0.981 6	0.981 6	0.975 9	0.975 3	0.975 9	0.975 6	0.975 3	0.981 6
G6	0.875 0	0.923 1	0.777 8	0.857 1	0.750 0	0.923 1	0.750 0	0.750 0	0.705 9	0.857 1	0.750 0	0.923 1
Y3	0.800 0	0.702 7	0.695 7	0.712 9	0.697 2	0.699 0	0.683 8	0.704 8	0.672 3	0.826 7	0.769 2	0.752 5
E3	0.782 6	0.666 7	0.666 7	0.692 3	0.782 6	0.636 4	0.782 6	0.608 7	0.600 0	0.695 7	0.571 4	0.400 0
V0	0.893 6	0.936 2	0.808 5	0.936 2	0.808 5	0.936 2	0.791 7	0.977 8	0.782 6	0.936 2	0.826 1	0.926 8
C0	0.750 0	0.857 1	0.600 0	0.666 7	0.600	0.666 7	0.750 0	0.666 7	0.545 5	0.857 1	0.666 7	0.500 0
E0	0.800 0	0.727 3	0.615 4	0.666 7	0.666 7	0.800 0	0.727 3	0.888 9	0.571 4	0.666 7	0.800 0	0.888 9
A9	0.769 2	0.645 2	0.555 6	0.320 0	0.769 2	0.571 4	0.720 0	0.645 2	0.645 2	0.500 0	0.705 9	0.285 7
Y5	0.523 8	0.580 6	0.594 6	0.600 0	0.550 0	0.600 0	0.536 6	0.600 0	0.536 6	0.428 6	0.666 7	0.580 6
A17	0.381 0	0.305 6	0.307 7	0.289 5	0.333 3	0.289 9	0.349 2	0.328 4	0.320 0	0.274 0	0.333 3	0.125 0
D1	0.481 3	0.548 4	0.422 9	0.452 2	0.434 0	0.507 4	0.345 7	0.428 6	0.290 0	0.373 4	0.749 2	0.823 5
D2	0.894 7	0.938 1	0.953 3	0.972 5	0.927 3	0.954 1	0.585 4	0.953 3	0.608 2	0.972 5	0.923 1	0.772 7

Imbalanced data classification method based on Lasso and constructive covering algorithm

基于Lasso和构造性覆盖算法的不均衡数据分类方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 28

Related Articles 5

Recommended Articles

Metrics

数据集	L-CCSmote		S-Enn		S-Tomek		B1-S		Adasyn		OSS
数据集	LR	SVM	LR	SVM	LR	SVM	LR	SVM	LR	SVM	LR	SVM
AR	1.47	1.67	2.87	2.93	2.60	2.60	3.33	2.47	3.40	3.27	4.13	3.40
G1	0.692 5	0.711 3	0.639 8	0.670 7	0.654 1	0.695 9	0.521 1	0.704 4	0.561 7	0.655 8	0.720 3	0.735 8
P1	0.746 7	0.744 6	0.704 6	0.723 0	0.723 3	0.738 2	0.734 2	0.743 0	0.734 7	0.728 6	0.706 5	0.716 4
V3	0.858 5	0.817 6	0.814 5	0.801 9	0.833 3	0.808 2	0.827 0	0.808 2	0.849 1	0.811 3	0.773 6	0.676 1
S0	0.986 8	0.986 8	0.986 8	0.980 7	0.986 8	0.986 8	0.990 9	0.980 7	0.990 9	0.985 8	0.980 7	0.986 8
G6	0.978 7	0.928 6	0.957 4	0.917 9	0.896 7	0.928 6	0.896 7	0.896 7	0.886 0	0.917 9	0.896 7	0.928 6
Y3	0.939 2	0.928 6	0.936 3	0.902 7	0.918 0	0.899 6	0.933 3	0.910 3	0.930 2	0.873 5	0.855 2	0.930 1
E3	0.966 7	0.897 8	0.940 0	0.946 7	0.966 7	0.848 9	0.966 7	0.844 2	0.920 0	0.904 4	0.715 6	0.646 7
V0	0.968 4	0.993 3	0.918 5	0.993 3	0.918 5	0.993 3	0.916 3	0.997 8	0.895 8	0.993 3	0.920 7	0.931 8
C0	0.975 6	0.987 8	0.951 2	0.821 1	0.951 2	0.821 1	0.975 6	0.821 1	0.939 0	0.987 8	0.821 1	0.666 7
E0	0.892 3	0.884 6	0.869 2	0.876 9	0.876 9	0.892 3	0.884 6	0.900 0	0.861 5	0.876 9	0.892 3	0.900 0
A9	0.940 0	0.925 5	0.910 9	0.773 5	0.940 0	0.913 8	0.894 6	0.925 5	0.925 5	0.899 3	0.772 7	0.588 0
Y5	0.972 2	0.893 8	0.979 2	0.895 2	0.975 0	0.895 2	0.973 6	0.895 2	0.973 6	0.636 4	0.771 3	0.893 8
A17	0.868 4	0.826 3	0.855 3	0.822 8	0.860 5	0.794 7	0.834 2	0.830 7	0.857 9	0.791 2	0.600 0	0.533 3
D1	0.897 6	0.903 8	0.894 5	0.894 3	0.893 7	0.901 1	0.885 0	0.892 5	0.861 6	0.873 9	0.846 6	0.886 2
D2	0.970 5	0.989 6	0.971 8	0.990 4	0.971 3	0.980 9	0.932 8	0.971 8	0.969 3	0.990 4	0.944 1	0.814 8

算法	F1		AUC		G-MEAN
算法	LR	SVM	LR	SVM	LR	SVM
S-Enn	0.003 5	0.041 3	0.002 3	0.018 6	0.002 3	0.018 6
S-Tomek	0.004 7	0.209 4	0.003 7	0.007 6	0.002 4	0.007 6
B1-S	0.003 7	0.729 9	0.003 7	0.025 8	0.002 3	0.021 9
Adasyn	0.000 9	0.074 7	0.001 2	0.004 6	0.001 2	0.003 7
OSS	0.080 0	0.182 3	0.001 5	0.012 1	0.001 5	0.012 1

算法	F1		AUC		G-MEAN
算法	LR	SVM	LR	SVM	LR	SVM
S-Enn	65.56	68.62	88.87	86.74	88.66	86.43
S-Tomek	68.16	70.05	89.11	87.32	88.84	87.03
B1-S	65.63	69.93	87.78	87.48	86.73	87.19
Adasyn	61.03	68.35	87.71	86.18	86.86	85.22
OSS	70.92	65.18	81.45	78.90	79.99	74.30
L-CCSmote	73.01	71.91	91.03	89.47	90.72	89.32

[1]	Xuewen LIU, Jikui WANG, Zhengguo YANG, Qiang LI, Jihai YI, Bing LI, Feiping NIE. Imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization [J]. Journal of Computer Applications, 2022, 42(5): 1455-1463.
[2]	LI Yao, ZHAO Yunpeng, LI Xinyun, LIU Zhifen, CHEN Junjie, GUO Hao. Construction of brain functional hypernetwork and feature fusion analysis based on sparse group Lasso method [J]. Journal of Computer Applications, 2020, 40(1): 62-70.
[3]	WANG Lin, GUO Nana. Imbalanced telecom customer data classification method based on dissimilarity [J]. Journal of Computer Applications, 2017, 37(4): 1032-1037.
[4]	MAO Wentao, WANG Jinwan, HE Ling, YUAN Peiyan. Hybrid sampling extreme learning machine for sequential imbalanced data [J]. Journal of Computer Applications, 2015, 35(8): 2221-2226.
[5]	CAO Peng LI Bo LI Wei ZHAO Dazhe. Imbalanced data learning based on particle swarm optimization [J]. Journal of Computer Applications, 2013, 33(03): 789-792.