计算机应用 ›› 2016, Vol. 36 ›› Issue (12): 3292-3297.DOI: 10.11772/j.issn.1001-9081.2016.12.3292

• 先进计算 • 上一篇    下一篇

基于Spark框架的高效KNN中文文本分类算法

于苹苹1, 倪建成2, 姚彬修1, 李淋淋1, 曹博1   

  1. 1. 曲阜师范大学 信息科学与工程学院, 山东 日照 276826;
    2. 曲阜师范大学 软件学院, 山东 曲阜 273100
  • 收稿日期:2016-06-30 修回日期:2016-08-30 出版日期:2016-12-10 发布日期:2016-12-08
  • 通讯作者: 倪建成
  • 作者简介:于苹苹(1991-),女,山东济南人,硕士研究生,CCF会员,主要研究方向:并行与分布式计算、数据挖掘;倪建成(1971-),男,山东济宁人,教授,博士,CCF高级会员,主要研究方向:分布式计算、机器学习、数据挖掘;姚彬修(1991-),男,山东潍坊人,硕士研究生,CCF会员,主要研究方向:分布式计算、数据挖掘、微博推荐;李淋淋(1991-),女,山东德州人,硕士研究生,CCF会员,主要研究方向:分布式计算、数据挖掘;曹博(1992-),女,黑龙江伊春人,硕士研究生,CCF会员,主要研究方向:并行与分布式计算、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61402258);山东省本科高校教学改革研究项目(2015M102);校级教学改革研究项目(jg05021*)。

Highly efficient Chinese text classification algorithm of KNN based on Spark framework

YU Pingping1, NI Jiancheng2, YAO Binxiu1, LI Linlin1, CAO Bo1   

  1. 1. School of Information Science and Engineering, Qufu Normal University, Rizhao Shandong 276826, China;
    2. School of Software Engineering, Qufu Normal University, Qufu Shandong 273100, China
  • Received:2016-06-30 Revised:2016-08-30 Online:2016-12-10 Published:2016-12-08
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61402258), the Research Project of Teaching Reform in Undergraduate Colleges and Universities of Shandong Province (2015M102), the Research Project of Teaching Reform of Universities (jg05021*).

摘要: 针对K-最近邻(KNN)分类算法时间复杂度与训练样本数量成正比而导致的计算量大的问题以及当前大数据背景下面临的传统架构处理速度慢的问题,提出了一种基于Spark框架与聚类优化的高效KNN分类算法。该算法首先利用引入收缩因子的优化K-medoids聚类算法对训练集进行两次裁剪;然后在分类过程中迭代K值获得分类结果,并在计算过程中结合Spark计算框架对数据进行分区迭代实现并行化。实验结果表明,在不同数据集中传统K-最近邻算法、基于K-medoids的K-最近邻算法所耗费时间是所提Spark框架下的K-最近邻算法的3.92~31.90倍,所提算法具有较高的计算效率,相较于Hadoop平台有较好的加速比,可有效地对大数据进行分类处理。

关键词: K-最近邻, 聚类, 收缩因子, K-medoids, Spark, 并行化计算

Abstract: The time complexity of K-Nearest Neighbor(KNN) classification algorithm is proportional to the number of training samples, which needs a large number of computation, and the bottleneck of slow processing exists in traditional architecture under the big data background. In order to solve the problems, a highly efficient algorithm of KNN based on Spark framework and clustering was proposed. Firstly, the training set was cut twice by the optimized K-medoids algorithm through introducing constriction factor. Then the K was iterated constantly in the process of classification and the classification result was obtained. And the data was partitioned and iterated to realize parallelization combining the Spark framework in the calculation. The experimental results show that, the classification time of the traditional KNN algorithm and the KNN algorithm based on K-medoids is 3.92-31.90 times of the proposed algorithm in different datasets. The proposed algorithm has high computational efficiency and better speedup ratio than KNN based on Hadoop platform, and it can effectively classify the big data.

Key words: K-Nearest Neighbor(KNN), clustering, constriction factor, K-medoids, Spark, parallel computing

中图分类号: