基于差异度的不均衡电信客户数据分类方法

doi:10.11772/j.issn.1001-9081.2017.04.1032

计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1032-1037.DOI: 10.11772/j.issn.1001-9081.2017.04.1032

基于差异度的不均衡电信客户数据分类方法

王林, 郭娜娜

西安理工大学自动化与信息工程学院, 西安 710048

收稿日期:2016-09-05 修回日期:2016-12-26 出版日期:2017-04-10 发布日期:2017-04-19
通讯作者: 郭娜娜
作者简介:王林(1962-),男,江苏东台人,教授,博士,主要研究方向:无线传感器网络、复杂网络社团发现、大数据、数据挖掘;郭娜娜(1992-),女,河南三门峡人,硕士研究生,主要研究方向:大数据、数据挖掘。
基金资助:
国家自然科学基金资助项目（61405157）。

Imbalanced telecom customer data classification method based on dissimilarity

WANG Lin, GUO Nana

College of Automation and Information Engineering, Xi'an University of Technology, Xi'an Shaanxi 710048, China

Received:2016-09-05 Revised:2016-12-26 Online:2017-04-10 Published:2017-04-19
Supported by:
This work is partially supported by National Natural Science Foundation of China (61405157).

摘要/Abstract

摘要： 针对传统分类技术对不均衡电信客户数据集中流失客户识别能力不足的问题，提出一种基于差异度的改进型不均衡数据分类（IDBC）算法。该算法在基于差异度分类（DBC）算法的基础上改进了原型选择策略。在原型选择阶段，利用改进型的样本子集优化方法从整体数据集中选择最具参考价值的原型集，从而避免了随机选择所带来的不确定性；在分类阶段，分别利用训练集和原型集、测试集和原型集样本之间的差异性构建相应的特征空间，进而采用传统的分类预测算法对映射到相应特征空间内的差异度数据集进行学习。最后选用了UCI数据库中的电信客户数据集和另外6个普通的不均衡数据集对该算法进行验证，相对于传统基于特征的不均衡数据分类算法，DBC算法对稀有类的识别率平均提高了8.3%，IDBC算法对稀有类的识别率平均提高了11.3%。实验结果表明，所提IDBC算法不受类别分布的影响，而且对不均衡数据集中稀有类的识别能力优于已有的先进分类技术。

关键词: 客户流失预测, 不均衡数据分类, 样本子集优化, 原型选择, 差异度转化

Abstract: It is difficult for conventional classification technology to discriminate churn customers in the context of imbalanced telecom customer dataset, therefore, an Improved Dissimilarity-Based imbalanced data Classification (IDBC) algorithm was proposed by introducing an improved prototype selection strategy to Dissimilarity-Based Classification (DBC) algorithm. In prototype selection stage, the improved sample subset optimization method was adopted to select the most valuable prototype set from the whole dataset, thus avoiding the uncertainties caused by the random selection; in classification stage, new feature space was constructed via dissimilarity between samples from train set and prototype set, and samples from test set and prototype set, and then dissimilarity-based datasets mapped into corresponding feature space were learnt with conventional classification algorithms. Finally, the telecom customer dataset and other six ordinary imbalanced datasets from UCI database were selected to test the performance of IDBC. Compared with the traditional imbalanced data classification algorithm based on features, the recognition rate of DBC algorithm for rare class was improved by 8.3% on average, and the recognition rate of IDBC algorithm for raw class was increased by 11.3%. The experimental results show that the IDBC algorithm is not affected by the category distribution, and the discriminative ability of IDBC algorithm outperforms existing state-of-the-art approaches.

Key words: customer churn prediction, imbalanced data classification, Sample Subset Optimization (SSO), prototype selection, dissimilarity transformation

中图分类号:

TP301.6

王林, 郭娜娜. 基于差异度的不均衡电信客户数据分类方法[J]. 计算机应用, 2017, 37(4): 1032-1037.

WANG Lin, GUO Nana. Imbalanced telecom customer data classification method based on dissimilarity[J]. Journal of Computer Applications, 2017, 37(4): 1032-1037.

参考文献

[1] 曹鹏, 李博, 栗伟, 等. 基于粒子群优化的不均衡数据学习[J]. 计算机应用, 2013, 33(3): 789-792.(CAO P, LI B, LI W, et al. Imbalanced data learning based on particle swarm optimization[J]. Journal of Computer Applications, 2013, 33(3): 789-792.)
[2] LIU X, WU J, ZHOU Z. Exploratory under-sampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2009, 39(2): 539-550.
[3] LI P, YU X, SUN B, et al. Telecom customer churn prediction based on imbalanced data re-sampling method[C]//Proceedings of the 2013 International Conference on Measurement, Information and Control. Piscataway, NJ: IEEE, 2013: 229-233.
[4] 古平, 欧阳源遊. 基于混合采样的非平衡数据集分类研究[J]. 计算机应用研究, 2015, 32(2): 379-381.(GU P, OUYANG Y Y. Classification research for unbalanced data based on mixed-sampling[J]. Application Research of Computers, 2015, 32(2): 379-381.)
[5] 翟云, 王树鹏, 马楠, 等. 基于单边选择链和样本分布密度融合机制的非平衡数据挖掘方法[J]. 电子学报, 2014, 42(7): 1311-1319.(ZHAI Y, WANG S P, MA N, et al. A data mining method for imbalanced datasets based on one-sided link and distribution density of instances[J]. Acta Electronica Sinica, 2014, 42(7): 1311-1319.)
[6] IDRIS A, KHAN A. Churn prediction system for telecom using filter-wrapper and ensemble classification[J/OL]. The Computer Journal, 2016[2016-06-01]. http://comjnl.oxfordjournals.org/content/early/2016/05/27/comjnl.bxv123.abstract.
[7] LI P, LI S B, BI T T, et al. Telecom customer churn prediction method based on cluster stratified sampling logistic regression[C]//Proceedings of the 2014 International Conference on Software Intelligence Technologies and Applications & International Conference on Frontiers of Internet of Things. London, UK: IET, 2014: 282-287.
[8] 丁君美, 刘贵全, 李慧. 改进随机森林算法在电信业客户流失预测中的应用[J]. 模式识别与人工智能, 2015,28(11): 1041-1049.(DING J M, LIU G Q, LI H. The application of improved random forest in the telecom customer churn prediction[J]. Pattern Recognition and Artificial Intelligence, 2015,28(11): 1041-1049.)
[9] EFFENDY V, BAIZAL Z K A. Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest[C]//Proceedings of the 20142nd International Conference on Information and Communication Technology. Piscataway, NJ: IEEE, 2014: 325-330.
[10] 蒋国瑞, 司学峰. 基于代价敏感SVM的电信客户流失预测研究[J]. 计算机应用研究, 2009, 26(2): 521-523.(JIANG G R, SI X F. Study of telecom customer churn prediction based on cost sensitive SVM[J]. Application Research of Computers, 2009, 26(2): 521-523.)
[11] JAPKOWICZ N, STEPHEN S. The class imbalance problem: a systematic study[J]. Intelligent Data Analysis, 2002, 6(5): 429-449.
[12] ELLOUMI M, ZOMAYA A Y, YANG P, et al. Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics[EB/OL].[2016-03-10]. http://onlinelibrary.wiley.com/doi/10.1002/9781118617151.ch14/summary.
[13] ZHANG X, SONG Q, WANG G, et al. A dissimilarity-based imbalance data classification algorithm[J]. Applied Intelligence, 2015, 42(3): 544-565.
[14] PEKALSKA E, DUIN R P W. Dissimilarity representations allow for building good classifiers[J]. Pattern Recognition Letters, 2002, 23(8): 943-956.
[15] PEKALSKA E, DUIN R P W, PACLIK P. Prototype selection for dissimilarity-based classifiers[J]. Pattern Recognition, 2006, 39(2): 189-208.
[16] DUIN R P W, PEKALSKA E. The dissimilarity representation for pattern recognition: a tutorial[EB/OL].[2016-03-10]. http://homepage.tudelft.nl/a9p19/presentations/DisRep_Tutorial_doc.pdf.
[17] YANG P Y, YOO P D, FERNANDO J, et al. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications[J]. IEEE Transactions on Cybernetics, 2014, 44(3): 445-455.
[18] KENNEDY J, EBERHART R. Particle swarm optimization[C]//Proceedings of the 1995 IEEE International Conference on Neural Networks. Piscataway, NJ: IEEE, 1995: 1942-1948.

基于差异度的不均衡电信客户数据分类方法

Imbalanced telecom customer data classification method based on dissimilarity

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics