基于商空间理论的非平衡数据集分类算法

doi:10.3724/SP.J.1087.2012.00210

计算机应用 ›› 2012, Vol. 32 ›› Issue (01): 210-212.DOI: 10.3724/SP.J.1087.2012.00210

基于商空间理论的非平衡数据集分类算法

张健,方宏彬,孙启林,刘明术

安徽大学数学科学学院，合肥 230039

收稿日期:2011-07-15 修回日期:2011-09-21 发布日期:2012-02-06 出版日期:2012-01-01
通讯作者: 张健
作者简介:张健(1981-)，男，安徽安庆人，硕士研究生，主要研究方向：机器学习、模式识别；方宏彬(1972-)，男，安徽池州人，副教授，博士，主要研究方向：智能计算、信息融合；孙启林(1982-)，男，安徽合肥人，硕士研究生，主要研究方向：金融数据挖掘。
基金资助:
国家自然科学基金资助项目(71071002);安徽省教育厅自然科学基金资助项目(05010428);安徽大学人才队伍建设项目;安徽大学学术创新团队项目(KJTD001B)

Classification algorithm for imbalance dataset based on quotient space theory

ZHANG Jian,FANG Hong-bin,SUN Qi-lin,LIU Mingshu

School of Mathematical Sciences, Anhui University, Hefei Anhui 230039, China

Received:2011-07-15 Revised:2011-09-21 Online:2012-02-06 Published:2012-01-01
Contact: ZHANG Jian

摘要/Abstract

摘要： 在机器学习及其分类问题时经常会遇到非平衡数据集,为了提高非平衡数据集分类的有效性，提出了基于商空间理论的过采样分类算法，即QMSVM算法。对训练集中多数类样本进行聚类结构划分，所得划分结果和少数类样本合并进行线性支持向量机(SVM)学习，从而获取多数类样本的支持向量和错分的样本粒；另一方面，获取少数类样本的支持向量和错分的样本，进行SMOTE采样，最后把上述得到的两类样本合并进行SVM学习，这样来实现学习数据集的再平衡处理，从而得到更加合理的分类超平面。实验结果表明，和其他几种算法相比，所提算法虽在正确分类率上有所降低，但较大改善了g_means值和acc+值，且对非平衡率较大的数据集效果会更好。

关键词: 非平衡数据集, 商空间理论, 支持向量机, 过采样, QMSVM算法

Abstract: The application of data classification is usually confronted with a problem named imbalanced dataset in the machine learning. To improve the performance of imbalanced dataset classification, the over-sampling classification algorithm based on quotient space theory (QMSVM) was proposed. The algorithm partitioned majority data on clustering structure, and combined the results and minority data for linear Support Vector Machine (SVM) learning. Support vectors and sample of fault of majority data were obtained from those granules. On the other hand, support vectors and sample of fault of minority data were obtained and the Synthetic Minority Over-sampling Technique (SMOTE) was adopted. Thus, two new kinds of samples were merged for SVM learning, so as to rebalance the training set and get a more reasonable classification of hyperplanes. The experimental results show that, in comparison with several other algorithms, the accuracy of the proposed algorithm decreases, but it significantly improves the g_means value and classification accuracy of positives and the effect is better on the imbalance rate of larger datasets.

Key words: unbalanced dataset, quotient space theory, Support Vector Machine (SVM), over-sampling, QMSVM algorithm

中图分类号:

TP311.13

张健方宏彬孙启林刘明术. 基于商空间理论的非平衡数据集分类算法[J]. 计算机应用, 2012, 32(01): 210-212.

ZHANG Jian FANG Hong-bin SUN Qi-lin LIU Mingshu. Classification algorithm for imbalance dataset based on quotient space theory[J]. Journal of Computer Applications, 2012, 32(01): 210-212.

[1]	王垚, 孙国梓. 基于聚类和实例硬度的入侵检测过采样方法[J]. 计算机应用, 2021, 41(6): 1709-1714.
[2]	贾鹤鸣, 姜子超, 李瑶, 孙康健. 基于改进斑点鬣狗优化算法的同步优化特征选择[J]. 计算机应用, 2021, 41(5): 1290-1298.
[3]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[4]	袁芊芊, 邓洪敏, 王晓航. 基于超像素快速模糊C均值聚类与支持向量机的柑橘病虫害区域分割[J]. 计算机应用, 2021, 41(2): 563-570.
[5]	李凯, 李洁. 基于pinball损失的结构模糊多分类支持向量机算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3104-3112.
[6]	陆荣秀, 陈明明, 杨辉, 朱建勇. 基于溶液图像时序特征的元素组分含量动态监测系统[J]. 计算机应用, 2021, 41(10): 3075-3081.
[7]	童林, 官铮. 改进鲸鱼优化支持向量机的交通流量模糊粒化预测[J]. 计算机应用, 2021, 41(10): 2919-2927.
[8]	崔鑫, 徐华, 宿晨. 面向不均衡数据集的过抽样算法[J]. 计算机应用, 2020, 40(6): 1662-1667.
[9]	张健铭, 施元昊, 徐正蓺, 魏建明. 基于误差预测的自适应UWB/PDR融合定位算法[J]. 计算机应用, 2020, 40(6): 1755-1762.
[10]	黄功, 赵永平, 谢云龙. 基于局部密度的加权一类支持向量机算法及其在涡轴发动机故障检测中的应用[J]. 计算机应用, 2020, 40(3): 917-924.
[11]	王杨, 赵红东. 基于改进粒子群优化的支持向量机与情景感知的人体活动识别[J]. 计算机应用, 2020, 40(3): 665-671.
[12]	赵一, 段兴, 谢仕义, 梁春林. 面向特定目标自识别的交通图像语义检索方法[J]. 计算机应用, 2020, 40(2): 553-560.
[13]	李卉, 杨志霞. 基于Rescaled Hinge损失函数的多子支持向量机[J]. 计算机应用, 2020, 40(11): 3139-3145.
[14]	牛晓可, 黄伊鑫, 徐华兴, 蒋震阳. 基于听皮层神经元感受野的强噪声环境下说话人识别[J]. 计算机应用, 2020, 40(10): 3034-3040.
[15]	王忠震, 黄勃, 方志军, 高永彬, 张娟. 改进SMOTE的不平衡数据集成分类算法[J]. 计算机应用, 2019, 39(9): 2591-2596.

基于商空间理论的非平衡数据集分类算法

Classification algorithm for imbalance dataset based on quotient space theory

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics