Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (4): 1086-1093.DOI: 10.11772/j.issn.1001-9081.2022040490

• Data science and technology • Previous Articles    

Imbalanced data classification method based on Lasso and constructive covering algorithm

Yi JIANG1, Shuping WU1(), Kun HU2, Linbo LONG1   

  1. 1.College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2.Cloud Computing Center of Yunnan Branch,China Telecom Corporation Limited,Kunming Yunnan 650200,China
  • Received:2022-04-14 Revised:2022-06-08 Accepted:2022-06-13 Online:2022-07-01 Published:2023-04-10
  • Contact: Shuping WU
  • About author:JIANG Yi, born in 1969, Ph. D., senior engineer. His research interests include computer architecture, software engineering, big data, network security.
    HU Kun, born in 1970, senior engineer. His research interests include cloud computing.
    LONG Linbo, born in 1988, Ph. D., associate professor. His research interests include compiler optimization, new storage technologies, big data, embedded systems.
  • Supported by:
    National Natural Science Foundation of China(61902045);Chongqing Technology Innovation and Application Development Special Key Project(cstc2019jscx-mbdxX0035)

基于Lasso和构造性覆盖算法的不均衡数据分类方法

蒋溢1, 伍书平1(), 胡昆2, 龙林波1   

  1. 1.重庆邮电大学 计算机科学与技术学院,重庆 400065
    2.中国电信股份有限公司 云南分公司云计算中心,昆明 650200
  • 通讯作者: 伍书平
  • 作者简介:蒋溢(1969—),男,湖北安陆人,正高级工程师,博士,CCF会员,主要研究方向:计算机体系结构、软件工程、大数据、网络安全;
    胡昆(1970—),男,云南昆明人,高级工程师,主要研究方向:云计算;
    龙林波(1988—),男,重庆人,副教授,博士,主要研究方向:编译器优化、新型存储技术、大数据、嵌入式系统。
  • 基金资助:
    国家自然科学基金资助项目(61902045);重庆市技术创新与应用发展专项重点项目(cstc2019jscx?mbdxX0035)

Abstract:

Aiming at the problem that the machine learning classification algorithms have insufficient ability to identify minority samples in the imbalanced data classification problems, an imbalanced data classification method L-CCSmote (Least absolute shrinkage and selection operator Constructive Covering Synthetic minority oversampling technique) was proposed by taking the telecom customer churn scenario as an example. Firstly, the churn costumer related features were extracted through Lasso (Least absolute shrinkage and selection operator) to optimize the model input. Then, a neural network was built through Constructive Covering Algorithm (CCA) to generate coverages that conformed to the overall distribution of samples. Finally, a single-sample coverage strategy, a sample diversity strategy and a sample density peak strategy were further proposed to perform a hybrid sampling to balance the data. Total of 13 imbalanced datasets and 2 desensitized telecom customer datasets were selected from KEEL data base, and the proposed method was verified on Logistic Regression (LR) and Support Vector Machine (SVM) classification algorithms respectively. On LR classification algorithm, compared with the Synthetic Minority Oversampling TEchnique Edited nearest neighbor (SMOTE-Enn), the proposed method had the average Geometric MEAN (G-MEAN) increased by 2.32%. On SVM classification algorithm, compared with the Borderline-SMOTE (Borderline Synthetic Minority Oversampling Technique), the proposed method had the average G-MEAN increased by 2.44%. Experimental results show that the proposed method can solve the influence of class skew distribution on classification, and its recognition ability for rare classes is better than that of the classical balanced data classification methods.

Key words: Lasso (Least absolute shrinkage and selection operator), constructive covering algorithm, imbalanced data classification, customer churn prediction, hybrid sampling

摘要:

针对机器学习分类算法在不均衡数据分类问题中对少数类样本识别能力不足的问题,以电信客户流失场景为例,提出一种不均衡数据分类方法L-CCSmote(Lasso Constructive Covering Smote)。首先,通过套索回归(Lasso)提取流失用户特征以优化模型输入;然后,通过构造性覆盖算法(CCA)建立神经网络生成符合样本整体分布的覆盖;最后,进一步提出单样本覆盖策略、样本多样性策略和样本密度峰值策略,通过以上策略混合采样以平衡数据。选用了KEEL数据库中的13个不均衡数据集和2个脱敏电信客户数据集,分别在逻辑回归(LR)和支持向量机(SVM)分类算法上对该方法进行验证。在LR分类算法上,与SMOTE-Enn(Synthetic Minority Oversampling TEchnique Edited nearest neighbor)相比,所提方法的平均几何平均值(G-MEAN)提升了2.32%;在SVM分类算法上,与Borderline-SMOTE(Borderline Synthetic Minority Oversampling Technique Edited)相比,所提方法的平均G-MEAN提升了2.44%。实验结果表明,所提方法能解决类别偏斜分布影响分类的问题,且对于稀有类的识别能力优于经典平衡数据方法。

关键词: Lasso, 构造性覆盖算法, 不均衡数据分类, 客户流失预测, 混合采样

CLC Number: