Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (7): 1900-1905.DOI: 10.11772/j.issn.1001-9081.2017.07.1900

Previous Articles     Next Articles

Performance optimization of ItemBased recommendation algorithm based on Spark

LIAO Bin1, ZHANG Tao2,3, GUO Binglei3, YU Jiong3, ZHANG Xuguang1, LIU Yan4   

  1. 1. College of Statistics and Information, Xinjiang University of Finance and Economics, Urumqi Xinjiang 830012, China;
    2. College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi Xinjiang 830011, China;
    3. School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830008, China;
    4. School of Software, Tsinghua University, Beijing 100084, China
  • Received:2017-01-16 Revised:2017-03-01 Online:2017-07-10 Published:2017-07-18
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61562078, 61262088), the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2016D01B014).

基于Spark的ItemBased推荐算法性能优化

廖彬1, 张陶2,3, 国冰磊3, 于炯3, 张旭光1, 刘炎4   

  1. 1. 新疆财经大学 统计与信息学院, 乌鲁木齐 830012;
    2. 新疆医科大学 医学工程技术学院, 乌鲁木齐 830011;
    3. 新疆大学 信息科学与工程学院, 乌鲁木齐 830008;
    4. 清华大学 软件学院, 北京 100084
  • 通讯作者: 廖彬
  • 作者简介:廖彬(1986-),男,四川内江人,副教授,博士,CCF会员,主要研究方向:绿色计算、数据挖掘、大数据计算模型;张陶(1988-),女,新疆乌鲁木齐人,博士研究生,主要研究方向:分布式计算、网格计算;国冰磊(1991-),女,湖北武汉人,博士研究生,主要研究方向:绿色计算、数据库系统;于炯(1964-),男,北京人,教授,博士,主要研究方向:网络安全、网格计算、分布式计算;张旭光(1994-),男,河南郑州人,硕士研究生,主要研究方向:大数据计算;刘炎(1990-),男,湖北武汉人,硕士研究生,主要研究方向:大数据计算。
  • 基金资助:
    国家自然科学基金资助项目(61562078,61262088);新疆维吾尔自治区自然科学基金资助项目(2016D01B014)。

Abstract: Under MapReduce computing scenarios, complex data mining algorithms typically require multiple MapReduce jobs' collaboration process to compete the task. However, serious redundant disk read and write and repeat resource request operations among multiple MapReduce jobs seriously degrade the performance of the algorithm under MapReduce. To improve the computational efficiency of ItemBased recommendation algorithm, firstly, the performance issues of the ItemBased collaborative filtering algorithm under MapReduce platform were analyzed. Secondly, the execution efficiency of the algorithm was improved by taking advantage of Spark's performance superiority on iterative computation and memory computing, and the ItemBased collaborative filtering algorithm under Spark platform was implemented. The experimental results show that, when the size of the cluster nodes is 10 and 20, the running time of the algorithm in Spark is only 25.6% and 30.8% of that in MapReduce. The algorithm's overall computing efficiency of Spark platform improves more than 3 times compared with that of MapReduce platform.

Key words: collaborative filtering, MapReduce, Spark algorithm, performance optimization, Directed Acyclic Graph (DAG)

摘要: MapReduce计算场景下,复杂的大数据挖掘类算法通常需要多个MapReduce作业协作完成,但多个作业之间严重的冗余磁盘读写及重复的资源申请操作,使得算法的性能严重降低。为提高ItemBased推荐算法的计算效率,首先对MapReduce平台下ItemBased协同过滤算法存在的性能问题进行了分析;在此基础上利用Spark迭代计算及内存计算上的优势提高算法的执行效率,并实现了基于Spark平台的ItemBased推荐算法。实验结果表明:当集群节点规模分别为10与20时,算法在Spark中的运行时间分别只有MapReduce中的25.6%及30.8%,Spark平台下的算法相比MapReduce平台,执行效率整体提高3倍以上。

关键词: 协同过滤, MapReduce, Spark算法, 性能优化, 有向非循环图

CLC Number: