基于分层聚类及重采样的大规模数据分类

计算机应用 ›› 2013, Vol. 33 ›› Issue (10): 2801-2803.

基于分层聚类及重采样的大规模数据分类

张永,浮盼盼,张玉婷

辽宁师范大学计算机与信息技术学院, 辽宁大连 116081

收稿日期:2013-03-13 修回日期:2013-04-24 出版日期:2013-10-01 发布日期:2013-11-01
通讯作者: 张永
作者简介:张永(1975-),男,四川阆中人,副教授,博士,CCF会员,主要研究方向:机器学习、智能计算;浮盼盼(1987-),女,河南新乡人,硕士研究生,主要研究方向:机器学习;张玉婷(1990-),女,黑龙江哈尔滨人,硕士研究生,主要研究方向:机器学习。
基金资助:
国家自然科学基金资助项目;中国博士后科学基金资助项目;辽宁省教育厅基金资助项目

Large-scale data classification based on hierarchical clustering and re-sampling

ZHANG Yong,FU Panpan,ZHANG Yuting

School of Computer and Information Technology, Liaoning Normal University, Dalian Liaoning 116081, China

Received:2013-03-13 Revised:2013-04-24 Online:2013-11-01 Published:2013-10-01
Contact: ZHANG Yong

摘要/Abstract

摘要： 针对大规模数据的分类问题,将监督学习与无监督学习结合起来,提出了一种基于分层聚类和重采样技术的支持向量机(SVM)分类方法。该方法首先利用无监督学习算法中的k-means聚类分析技术将数据集划分成不同的子集,然后对各个子集进行逐类聚类,分别选出各类中心邻域内的样本点,构成最终的训练集,最后利用支持向量机对所选择的最具代表样本点进行训练建模。实验表明,所提方法可以大幅度降低支持向量机的学习代价,其分类精度比随机欠采样更优,而且可以达到采用完整数据集训练所得的结果

关键词: 海量数据, 分类, 聚类, 重采样, 支持向量机

Abstract: Based on hierarchical clustering and re-sampling, this paper presented a Support Vector Machine (SVM) classification method for large-scale data, which combined supervised learning with unsupervised learning. The proposed method first used k-means cluster analytical technology to partition dataset into several subsets. Then, the method clustered class by class for each subset and selected samples in each clustering center neighborhood to form candidate training datasets. Last, the method applied SVM to train and model for candidate training datasets. The experimental results show that the proposed method can substantially reduce SVM learning cost. Meanwhile, the proposed method has better classification accuracy than random re-sampling method, and can attain about the same classification accuracy of the non-sampling method.

Key words: large-scale data, classification, clustering, re-sampling, Support Vector Machine (SVM)

中图分类号:

TP181

张永浮盼盼张玉婷. 基于分层聚类及重采样的大规模数据分类[J]. 计算机应用, 2013, 33(10): 2801-2803.

ZHANG Yong FU Panpan ZHANG Yuting. Large-scale data classification based on hierarchical clustering and re-sampling[J]. Journal of Computer Applications, 2013, 33(10): 2801-2803.

[1]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[2]	宋中山, 梁家锐, 郑禄, 刘振宇, 帖军. 基于双向门控尺度特征融合的遥感场景分类[J]. 计算机应用, 2021, 41(9): 2726-2735.
[3]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[4]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[5]	胡天杰, 胡文军, 王士同. 分布熵惩罚的支持向量数据描述[J]. 计算机应用, 2021, 41(8): 2212-2218.
[6]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[7]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[8]	朱亮, 徐华, 崔鑫. 基于基分类器系数和多样性的改进AdaBoost算法[J]. 计算机应用, 2021, 41(8): 2225-2231.
[9]	张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.
[10]	肖振远, 王逸涵, 罗建桥, 熊鹰, 李柏林. 基于部分加权损失函数的RefineDet[J]. 计算机应用, 2021, 41(7): 1928-1932.
[11]	尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.
[12]	章惠, 张娜娜, 黄俊. 优化LeNet-5网络的多角度头部姿态估计方法[J]. 计算机应用, 2021, 41(6): 1667-1672.
[13]	史杨潇, 章军, 陈鹏, 王兵. 基于轻量级网络的钢铁表面缺陷分类[J]. 计算机应用, 2021, 41(6): 1836-1841.
[14]	贾鹤鸣, 郎春博, 姜子超. 基于轻量级卷积神经网络的植物叶片病害识别方法[J]. 计算机应用, 2021, 41(6): 1812-1819.
[15]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.