Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (5): 1287-1291.DOI: 10.11772/j.issn.1001-9081.2017.05.1287

Previous Articles     Next Articles

Weighted Slope One algorithm based on clustering and Spark framework

LI Linlin1, NI Jiancheng2, YU Pingping1, YAO Binxiu1, CAO Bo1   

  1. 1. College of Information Science and Engineering, Qufu Normal University, Rizhao Shandong 276826, China;
    2. College of Software Engineering, Qufu Normal University, Qufu Shandong 273165, China
  • Received:2016-09-30 Revised:2016-12-07 Online:2017-05-10 Published:2017-05-16
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (the Youth Fund) (61402258), the Research Project of Teaching Reform in Undergraduate Colleges and Universities in Shandong Province (2015M102), the Research Project of Teaching Reform in Qufu Normal Universities (jg05021*).

基于聚类和Spark框架的加权Slope One算法

李淋淋1, 倪建成2, 于苹苹1, 姚彬修1, 曹博1   

  1. 1. 曲阜师范大学 信息科学与工程学院, 山东 日照 276826;
    2. 曲阜师范大学 软件学院, 山东 曲阜 273165
  • 通讯作者: 倪建成
  • 作者简介:李淋淋(1991-),女,山东德州人,硕士研究生,CCF会员,主要研究方向:并行与分布式计算、数据挖掘;倪建成(1971-),男,山东济宁人,教授,博士,CCF会员,主要研究方向:分布式计算、机器学习;数据挖掘;于苹苹(1991-),女,山东济南人,硕士研究生,CCF会员,主要研究方向:分布式计算、数据挖掘;姚彬修(1991-),男,山东潍坊人,硕士研究生,CCF会员,主要研究方向:分布式计算、数据挖掘、微博推荐;曹博(1992-),女,黑龙江伊春人,硕士研究生,CCF会员,主要研究方向:并行与分布式计算、数据挖掘。
  • 基金资助:

Abstract: In view of that the traditional Slope One algorithm does not consider the influence of project attribute information and time factor on project similarity calculation, and there exists high computational complexity and slow processing in current large data background, a weighted Slope One algorithm based on clustering and Spark framework was put forward. Firstly, the time weight was added to the traditional item score similarity calculation, and comprehensive similarity was computed with the similarities of the item attributes. And then the set of nearest neighbors was generated through combining with the Canopy-K-means algorithm. Finally, the data was partitioned and iterated to realize parallelization by Spark framework. The experimental results show that the improved algorithm based on the Spark framework is more accurate than the traditional Slope One algorithm and the Slope One algorithm based on user similarity, which can improve the operating efficiency by 3.5-5 times compared with the Hadoop platform, and is more suitable for large-scale dataset recommendation.

Key words: Slope One algorithm, clustering, Spark, time weight, item attribute

摘要: 针对传统Slope One算法在相似性计算时未考虑项目属性信息和时间因素对项目相似性计算的影响,以及推荐在当前大数据背景下面临的计算复杂度高、处理速度慢的问题,提出了一种基于聚类和Spark框架的加权Slope One算法。首先,将时间权重加入到传统的项目评分相似性计算中,并引入项目属性相似性生成项目综合相似度;然后,结合Canopy-K-means聚类算法生成最近邻居集;最后,利用Spark计算框架对数据进行分区迭代计算,实现该算法的并行化。实验结果表明,基于Spark框架的改进算法与传统Slope One算法、基于用户相似性的加权Slope One算法相比,评分预测准确性更高,较Hadoop平台下的运行效率平均可提高3.5~5倍,更适合应用于大规模数据集的推荐。

关键词: Slope One算法, 聚类, Spark, 时间权重, 项目属性

CLC Number: