基于Spark框架的高效KNN中文文本分类算法

doi:10.11772/j.issn.1001-9081.2016.12.3292

计算机应用 ›› 2016, Vol. 36 ›› Issue (12): 3292-3297.DOI: 10.11772/j.issn.1001-9081.2016.12.3292

基于Spark框架的高效KNN中文文本分类算法

于苹苹¹, 倪建成², 姚彬修¹, 李淋淋¹, 曹博¹

1. 曲阜师范大学信息科学与工程学院, 山东日照 276826;
2. 曲阜师范大学软件学院, 山东曲阜 273100

收稿日期:2016-06-30 修回日期:2016-08-30 出版日期:2016-12-10 发布日期:2016-12-08
通讯作者: 倪建成
作者简介:于苹苹(1991-),女,山东济南人,硕士研究生,CCF会员,主要研究方向:并行与分布式计算、数据挖掘;倪建成(1971-),男,山东济宁人,教授,博士,CCF高级会员,主要研究方向:分布式计算、机器学习、数据挖掘;姚彬修(1991-),男,山东潍坊人,硕士研究生,CCF会员,主要研究方向:分布式计算、数据挖掘、微博推荐;李淋淋(1991-),女,山东德州人,硕士研究生,CCF会员,主要研究方向:分布式计算、数据挖掘;曹博(1992-),女,黑龙江伊春人,硕士研究生,CCF会员,主要研究方向:并行与分布式计算、数据挖掘。
基金资助:
国家自然科学基金资助项目（61402258）；山东省本科高校教学改革研究项目（2015M102）；校级教学改革研究项目（jg05021*）。

Highly efficient Chinese text classification algorithm of KNN based on Spark framework

YU Pingping¹, NI Jiancheng², YAO Binxiu¹, LI Linlin¹, CAO Bo¹

1. School of Information Science and Engineering, Qufu Normal University, Rizhao Shandong 276826, China;
2. School of Software Engineering, Qufu Normal University, Qufu Shandong 273100, China

Received:2016-06-30 Revised:2016-08-30 Online:2016-12-10 Published:2016-12-08
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61402258), the Research Project of Teaching Reform in Undergraduate Colleges and Universities of Shandong Province (2015M102), the Research Project of Teaching Reform of Universities (jg05021*).

摘要/Abstract

摘要： 针对K-最近邻（KNN）分类算法时间复杂度与训练样本数量成正比而导致的计算量大的问题以及当前大数据背景下面临的传统架构处理速度慢的问题，提出了一种基于Spark框架与聚类优化的高效KNN分类算法。该算法首先利用引入收缩因子的优化K-medoids聚类算法对训练集进行两次裁剪；然后在分类过程中迭代K值获得分类结果，并在计算过程中结合Spark计算框架对数据进行分区迭代实现并行化。实验结果表明，在不同数据集中传统K-最近邻算法、基于K-medoids的K-最近邻算法所耗费时间是所提Spark框架下的K-最近邻算法的3.92~31.90倍，所提算法具有较高的计算效率，相较于Hadoop平台有较好的加速比，可有效地对大数据进行分类处理。

关键词: K-最近邻, 聚类, 收缩因子, K-medoids, Spark, 并行化计算

Abstract: The time complexity of K-Nearest Neighbor(KNN) classification algorithm is proportional to the number of training samples, which needs a large number of computation, and the bottleneck of slow processing exists in traditional architecture under the big data background. In order to solve the problems, a highly efficient algorithm of KNN based on Spark framework and clustering was proposed. Firstly, the training set was cut twice by the optimized K-medoids algorithm through introducing constriction factor. Then the K was iterated constantly in the process of classification and the classification result was obtained. And the data was partitioned and iterated to realize parallelization combining the Spark framework in the calculation. The experimental results show that, the classification time of the traditional KNN algorithm and the KNN algorithm based on K-medoids is 3.92-31.90 times of the proposed algorithm in different datasets. The proposed algorithm has high computational efficiency and better speedup ratio than KNN based on Hadoop platform, and it can effectively classify the big data.

Key words: K-Nearest Neighbor(KNN), clustering, constriction factor, K-medoids, Spark, parallel computing

中图分类号:

TP391.1

于苹苹, 倪建成, 姚彬修, 李淋淋, 曹博. 基于Spark框架的高效KNN中文文本分类算法[J]. 计算机应用, 2016, 36(12): 3292-3297.

YU Pingping, NI Jiancheng, YAO Binxiu, LI Linlin, CAO Bo. Highly efficient Chinese text classification algorithm of KNN based on Spark framework[J]. Journal of Computer Applications, 2016, 36(12): 3292-3297.

参考文献

[1] KALEGELE K, SASAI K, TAKAHASHI H, et al. Four decades of data mining in network and systems management[J]. IEEE Transactions on Knowledge & Data Engineering, 2015, 27(10):2700-2716.
[2] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859.(SU J S, ZHANG B F, XU X. Advances in machine learning based text categorization[J]. Journal of Software, 2006, 17(9):1848-1859.)
[3] LIU W, CHAWLA S. Class confidence weighted kNN algorithms for imbalanced data sets[C]//Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, LNCS 6635. Berlin:Springer, 2011:345-356.
[4] LIU Z G, PAN Q, DEZERT J. A new belief-based K-nearest neighbor classification method[J]. Pattern Recognition, 2013, 46(3):834-844.
[5] ZHANG L, ZHANG C J, XU Q Y, et al. Weigted-KNN and its application on UCI[C]//Proceedings of the 2015 IEEE International Conference on Information and Automation. Piscataway, NJ:IEEE, 2015:1748-1750.
[6] 刘闯.基于多核计算的分类数据挖掘算法研究[D].南京:南京航空航天大学,2011:12-20.(LIU C. Research on classification algorithms based on multicore computing[D]. Nanjing:Nanjing University of Aeronautics and Astronautics, 2011:12-20.)
[7] ANCHALIA P P, ROY K. The k-nearest neighbor algorithm using MapReduce paradigm[C]//Proceedings of the 20145th International Conference on ISMS (Intelligent Systems, Modelling and Simulation). Piscataway, NJ:IEEE, 2014:513-518.
[8] LU S P, TONG W Q, CHEN Z J. Implementation of the KNN algorithm based on Hadoop[C]//Proceedings of the 2015 International Conference on Smart and Sustainable City and Big Data. London:IET, 2015:123-126.
[9] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[C]//OSDI'04:Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation. Berkeley, CA:USENIX Association, 2004, 6:137-149.
[10] GHEMAWAT S, GOBIOFF H, LEUNG S T. The Google file system[J]. ACM SIGOPS Operating Systems Review, 2003, 37(5):29-43.
[11] GROLINGER K, HAYES M, HIGASHINO W A, et al. Challenges for MapReduce in big data[C]//Proceedings of the 2014 IEEE World Congress on Services. Piscataway, NJ:IEEE, 2014:182-189.
[12] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//NSDI'12:Proceedings of the 9th Usenix Conference on Networked Systems Design and Implementation. Berkeley, CA:USENIX Association, 2012:141-146.
[13] 夏宁霞,苏一丹,覃希.一种高效的K-medoids聚类算法[J].计算机应用研究,2010,27(12):4517-4519.(XIA N X, SU Y D, QIN X. Efficient K-medoids clustering algorithm[J]. Application Research of Computers, 2010, 27(12):4517-4519.)
[14] COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[15] ZAHARIA M, CHOWDHURY M, FRANKLIN,M J, et al. Spark:cluster computing with working sets[C]//Proceedings of the 20102nd Usenix Conference on Hot Topics in Cloud Computing. Berkeley, CA:USENIX Association, 2010:1765-1773.
[16] NIU K, ZHAO F, ZHANG S B. A fast classification algorithm for big data based on KNN[J]. Journal of Applied Sciences, 2013, 13(12):2208-2212.
[17] CHEN X Q, PENG H, HU J S. K-medoids substitution clustering method and a new clustering validity index method[C]//WCICA 2006:Proceedings of the 20066th World Congress on Intelligent Control and Automation. Piscataway, NJ:IEEE, 2006:5896-5900.
[18] 罗贤锋,祝胜林,陈泽健,等.基于K-Medoids聚类的改进KNN文本分类算法[J].计算机工程与设计,2014,35(11):3864-3867.(LUO X F, ZHU S L, CHEN Z J, et al. Improved KNN text categorization algorithm based on K-Medoids algorithm[J]. Computer Engineering and Design, 2014, 35(11):3864-3867.)

基于Spark框架的高效KNN中文文本分类算法

Highly efficient Chinese text classification algorithm of KNN based on Spark framework

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[4]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.
[5]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[6]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[7]	马建红, 曹文斌, 刘元刚, 夏爽. 基于功效特征的专利聚类方法[J]. 计算机应用, 2021, 41(5): 1361-1366.
[8]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[9]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.
[10]	邹志文, 秦程. 基于k-means++的动态构建空间主题R树方法[J]. 计算机应用, 2021, 41(3): 733-737.
[11]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[12]	吕佳, 鲜焱. 结合改进密度峰值聚类和共享子空间的协同训练算法[J]. 计算机应用, 2021, 41(3): 686-693.
[13]	张恩, 李会敏, 常键. 可验证的隐私保护k-means聚类方案[J]. 计算机应用, 2021, 41(2): 413-421.
[14]	袁芊芊, 邓洪敏, 王晓航. 基于超像素快速模糊C均值聚类与支持向量机的柑橘病虫害区域分割[J]. 计算机应用, 2021, 41(2): 563-570.
[15]	陈港, 孟相如, 康巧燕, 阳勇. 基于拓扑分割与聚类分析的虚拟软件定义网络映射算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3309-3318.