Large-scale data classification based on hierarchical clustering and re-sampling

Journal of Computer Applications ›› 2013, Vol. 33 ›› Issue (10): 2801-2803.

• Artificial intelligence • Previous Articles Next Articles

Large-scale data classification based on hierarchical clustering and re-sampling

ZHANG Yong,FU Panpan,ZHANG Yuting

School of Computer and Information Technology, Liaoning Normal University, Dalian Liaoning 116081, China

Received:2013-03-13 Revised:2013-04-24 Online:2013-11-01 Published:2013-10-01
Contact: ZHANG Yong

基于分层聚类及重采样的大规模数据分类

张永,浮盼盼,张玉婷

辽宁师范大学计算机与信息技术学院, 辽宁大连 116081

通讯作者: 张永
作者简介:张永(1975-),男,四川阆中人,副教授,博士,CCF会员,主要研究方向:机器学习、智能计算;浮盼盼(1987-),女,河南新乡人,硕士研究生,主要研究方向:机器学习;张玉婷(1990-),女,黑龙江哈尔滨人,硕士研究生,主要研究方向:机器学习。
基金资助:
国家自然科学基金资助项目;中国博士后科学基金资助项目;辽宁省教育厅基金资助项目

Abstract

Abstract: Based on hierarchical clustering and re-sampling, this paper presented a Support Vector Machine (SVM) classification method for large-scale data, which combined supervised learning with unsupervised learning. The proposed method first used k-means cluster analytical technology to partition dataset into several subsets. Then, the method clustered class by class for each subset and selected samples in each clustering center neighborhood to form candidate training datasets. Last, the method applied SVM to train and model for candidate training datasets. The experimental results show that the proposed method can substantially reduce SVM learning cost. Meanwhile, the proposed method has better classification accuracy than random re-sampling method, and can attain about the same classification accuracy of the non-sampling method.

Key words: large-scale data, classification, clustering, re-sampling, Support Vector Machine (SVM)

摘要： 针对大规模数据的分类问题,将监督学习与无监督学习结合起来,提出了一种基于分层聚类和重采样技术的支持向量机(SVM)分类方法。该方法首先利用无监督学习算法中的k-means聚类分析技术将数据集划分成不同的子集,然后对各个子集进行逐类聚类,分别选出各类中心邻域内的样本点,构成最终的训练集,最后利用支持向量机对所选择的最具代表样本点进行训练建模。实验表明,所提方法可以大幅度降低支持向量机的学习代价,其分类精度比随机欠采样更优,而且可以达到采用完整数据集训练所得的结果

关键词: 海量数据, 分类, 聚类, 重采样, 支持向量机

CLC Number:

TP181

ZHANG Yong FU Panpan ZHANG Yuting. Large-scale data classification based on hierarchical clustering and re-sampling[J]. Journal of Computer Applications, 2013, 33(10): 2801-2803.

张永浮盼盼张玉婷. 基于分层聚类及重采样的大规模数据分类[J]. 计算机应用, 2013, 33(10): 2801-2803.

[1]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[2]	Yuxin HUANG, Jialong XU, Zhengtao YU, Shukai HOU, Jiaqi ZHOU. Unsupervised text sentiment transfer method based on generation prompt [J]. Journal of Computer Applications, 2024, 44(9): 2667-2673.
[3]	Chun SUN, Chunlong HU, Shucheng HUANG. Consistency preserving age estimation method by ensemble ranking [J]. Journal of Computer Applications, 2024, 44(8): 2381-2386.
[4]	Qiangkui LENG, Xuezi SUN, Xiangfu MENG. Oversampling method for imbalanced data based on sample potential and noise evolution [J]. Journal of Computer Applications, 2024, 44(8): 2466-2475.
[5]	Quanmei ZHANG, Runping HUANG, Fei TENG, Haibo ZHANG, Nan ZHOU. Automatic international classification of disease coding method incorporating heterogeneous information [J]. Journal of Computer Applications, 2024, 44(8): 2476-2482.
[6]	Junchi GE, Weihua ZHAO. Distance weighted discriminant analysis based on robust principal component analysis for matrix data [J]. Journal of Computer Applications, 2024, 44(7): 2073-2079.
[7]	Qianhui LU, Yu ZHANG, Mengling WANG, Tingwei WU, Yuzhong SHAN. Classification model of nuclear power equipment quality text based on improved recurrent pooling network [J]. Journal of Computer Applications, 2024, 44(7): 2034-2040.
[8]	Qing WANG, Jieyu ZHAO, Xulun YE, Nongxiao WANG. Enhanced deep subspace clustering method with unified framework [J]. Journal of Computer Applications, 2024, 44(7): 1995-2003.
[9]	Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994.
[10]	Xu LI, Yulin HE, Laizhong CUI, Zhexue HUANG, Fournier‑Viger PHILIPPE. Distributed observation point classifier for big data with random sample partition [J]. Journal of Computer Applications, 2024, 44(6): 1727-1733.
[11]	Yao DONG, Yixue FU, Yongfeng DONG, Jin SHI, Chen CHEN. Survey of incomplete multi-view clustering [J]. Journal of Computer Applications, 2024, 44(6): 1673-1682.
[12]	Xiaoxia JIANG, Ruizhang HUANG, Ruina BAI, Lina REN, Yanping CHEN. Deep event clustering method based on event representation and contrastive learning [J]. Journal of Computer Applications, 2024, 44(6): 1734-1742.
[13]	Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823.
[14]	Xinyan YU, Cheng ZENG, Qian WANG, Peng HE, Xiaoyu DING. Few-shot news topic classification method based on knowledge enhancement and prompt learning [J]. Journal of Computer Applications, 2024, 44(6): 1767-1774.
[15]	Xun YAO, Zhongzheng QIN, Jie YANG. Generative label adversarial text classification model [J]. Journal of Computer Applications, 2024, 44(6): 1781-1785.

Large-scale data classification based on hierarchical clustering and re-sampling

基于分层聚类及重采样的大规模数据分类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics