Performance optimization of ItemBased recommendation algorithm based on Spark

doi:10.11772/j.issn.1001-9081.2017.07.1900

Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (7): 1900-1905.DOI: 10.11772/j.issn.1001-9081.2017.07.1900

Previous Articles Next Articles

Performance optimization of ItemBased recommendation algorithm based on Spark

LIAO Bin¹, ZHANG Tao^2,3, GUO Binglei³, YU Jiong³, ZHANG Xuguang¹, LIU Yan⁴

1. College of Statistics and Information, Xinjiang University of Finance and Economics, Urumqi Xinjiang 830012, China;
2. College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi Xinjiang 830011, China;
3. School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830008, China;
4. School of Software, Tsinghua University, Beijing 100084, China

Received:2017-01-16 Revised:2017-03-01 Online:2017-07-10 Published:2017-07-18
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61562078, 61262088), the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2016D01B014).

基于Spark的ItemBased推荐算法性能优化

廖彬¹, 张陶^2,3, 国冰磊³, 于炯³, 张旭光¹, 刘炎⁴

1. 新疆财经大学统计与信息学院, 乌鲁木齐 830012;
2. 新疆医科大学医学工程技术学院, 乌鲁木齐 830011;
3. 新疆大学信息科学与工程学院, 乌鲁木齐 830008;
4. 清华大学软件学院, 北京 100084

通讯作者: 廖彬
作者简介:廖彬(1986-),男,四川内江人,副教授,博士,CCF会员,主要研究方向:绿色计算、数据挖掘、大数据计算模型;张陶(1988-),女,新疆乌鲁木齐人,博士研究生,主要研究方向:分布式计算、网格计算;国冰磊(1991-),女,湖北武汉人,博士研究生,主要研究方向:绿色计算、数据库系统;于炯(1964-),男,北京人,教授,博士,主要研究方向:网络安全、网格计算、分布式计算;张旭光(1994-),男,河南郑州人,硕士研究生,主要研究方向:大数据计算;刘炎(1990-),男,湖北武汉人,硕士研究生,主要研究方向:大数据计算。
基金资助:
国家自然科学基金资助项目（61562078，61262088）；新疆维吾尔自治区自然科学基金资助项目（2016D01B014）。

Abstract

Abstract: Under MapReduce computing scenarios, complex data mining algorithms typically require multiple MapReduce jobs' collaboration process to compete the task. However, serious redundant disk read and write and repeat resource request operations among multiple MapReduce jobs seriously degrade the performance of the algorithm under MapReduce. To improve the computational efficiency of ItemBased recommendation algorithm, firstly, the performance issues of the ItemBased collaborative filtering algorithm under MapReduce platform were analyzed. Secondly, the execution efficiency of the algorithm was improved by taking advantage of Spark's performance superiority on iterative computation and memory computing, and the ItemBased collaborative filtering algorithm under Spark platform was implemented. The experimental results show that, when the size of the cluster nodes is 10 and 20, the running time of the algorithm in Spark is only 25.6% and 30.8% of that in MapReduce. The algorithm's overall computing efficiency of Spark platform improves more than 3 times compared with that of MapReduce platform.

Key words: collaborative filtering, MapReduce, Spark algorithm, performance optimization, Directed Acyclic Graph (DAG)

摘要： MapReduce计算场景下，复杂的大数据挖掘类算法通常需要多个MapReduce作业协作完成，但多个作业之间严重的冗余磁盘读写及重复的资源申请操作，使得算法的性能严重降低。为提高ItemBased推荐算法的计算效率，首先对MapReduce平台下ItemBased协同过滤算法存在的性能问题进行了分析；在此基础上利用Spark迭代计算及内存计算上的优势提高算法的执行效率，并实现了基于Spark平台的ItemBased推荐算法。实验结果表明：当集群节点规模分别为10与20时，算法在Spark中的运行时间分别只有MapReduce中的25.6%及30.8%，Spark平台下的算法相比MapReduce平台，执行效率整体提高3倍以上。

关键词: 协同过滤, MapReduce, Spark算法, 性能优化, 有向非循环图

CLC Number:

TP393.09

LIAO Bin, ZHANG Tao, GUO Binglei, YU Jiong, ZHANG Xuguang, LIU Yan. Performance optimization of ItemBased recommendation algorithm based on Spark[J]. Journal of Computer Applications, 2017, 37(7): 1900-1905.

廖彬, 张陶, 国冰磊, 于炯, 张旭光, 刘炎. 基于Spark的ItemBased推荐算法性能优化[J]. 计算机应用, 2017, 37(7): 1900-1905.

References

[1] The digital universe in 2020:big data, bigger digital shadows, and biggest growth in the far east[EB/OL].[2017-03-15]. http://www.emc.com/collateral/analyst-reports/idc-the-digitaluniverse-in-2020.pdf.
[2] 孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-149.(MENG X F, CI X. Big data management:concepts, techniques and challenges[J]. Journal of Computer Research and Development, 2013, 50(1):146-149)
[3] GHEMAWAT S, GOBIOFF H, LEUNG S T. The Google file system[C]//Proceedings of the 19th ACM Symposium on Operating System Principles. New York:ACM, 2003:29-43.
[4] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[C]//OSDI 2004:Proceedings of the 2004 Conference on Operating System Design and Implementation. New York:ACM, 2004:137-150.
[5] 廖彬,张陶,于炯,等.MapReduce能耗建模及优化分析[J].计算机研究与发展,2016,53(9):2107-2131.(LIAO B, ZHANG T, YU J, et al. Energy consumption modeling and optimization analysis for MapReduce[J]. Journal of Computer Research and Development, 2016, 53(9):2107-2131.)
[6] 廖彬,于炯,张陶,等.基于分布式文件系统HDFS的节能算法[J].计算机学报,2013,36(5):1047-1064.(LIAO B, YU J, ZHANG T, et al. Energy-efficient algorithms for distributed file system HDFS[J]. Chinese Journal of Computers, 2013, 36(5):1047-1064.)
[7] 张陶,于炯,廖彬,等.基于GraphX的传球网络构建及分析研究[J].计算机研究与发展,2016,53(12):2729-2752.(ZHANG T, YU J, LIAO B, et al. The construction and analysis of pass network graph based on GraphX[J]. Journal of Computer Research and Development, 2016, 53(12):2729-2752.)
[8] 宋杰,刘雪冰,朱志良,等.一种能效优化的MapReduce资源比模型[J].计算机学报,2015,38(1):59-73.(SONG J, LIU X B, ZHU Z L, et al. An energy-efficiency optimized resource ratio model for MapReduce[J]. Chinese Journal of Computers, 2015, 38(1):59-73.)
[9] 廖彬,张陶,于炯,等.温度感知的MapReduce节能任务调度策略[J].通信学报,2016,37(1):61-75.(LIAO B, ZHANG T, YU J, et al. Temperature aware energy-efficient task scheduling strategies for MapReduce[J]. Journal on Communications, 2016, 37(1):61-75.)
[10] 廖彬,张陶,于炯,等.适应节能与异构环境的MapReduce数据布局策略[J].中山大学学报(自然科学版),2015,54(6):55-66.(LIAO B, ZHANG T, YU J, et al. An energy-efficient and heterogeneous environment adaptive data layout strategy for MapReduce[J]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2015, 54(6):55-66.)
[11] 杨兴耀,于炯,吐尔根·依布拉音,等.融合奇异性和扩散过程的协同过滤模型.软件学报,2013,24(8):1868-1884.(YANG X Y, YU J, IBRAHIM T, et al. Collaborative filtering model fusing singularity and diffusion process[J]. Journal of Software, 2013, 24(8):1868-1884.)
[12] GHAUTH K I, ABDULLAH N A. Learning materials recommendation using good learners' ratings and content-based filtering[J]. Educational Technology Research and Development, 2010, 58(6):711-727.
[13] UDDIN M N, SHRESTHA J, JO G S. Enhanced content-based filtering using diverse collaborative prediction for movie recommendation[C]//Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems. Piscataway, NJ:IEEE, 2009:132-137.
[14] NGUYEN A T, DENOS N, BERRUT C. Improving new user recommendations with rule-based induction on cold user data[C]//Proceedings of the 2007 ACM Conference on Recommender Systems. New York:ACM, 2007:121-128.
[15] CHUN J, OH J Y, KWON S, et al. Simulating the effectiveness of using association rules for recommendation systems[C]//Proceedings of the 2005 Systems Modeling and Simulation:Theory and Applications. Berlin:Springer, 2005:306-314.
[16] QIU L Y, BENBASAT I. A study of demographic embodiments of product recommendation agents in electronic commerce[J]. International Journal of Human-Computer Studies, 2010, 68(10):669-688.
[17] CHEN T, HE L. Collaborative filtering based on demographic attribute vector[C]//Proceedings of the 2009 ETP International Conference on Future Computer and Communication. Piscataway, NJ:IEEE, 2009:225-229.
[18] JIA C X, LIU R R, SUN D, et al. A new weighting method in network-based recommendation[J]. Physica A-Statistical Mechanics and Its Applications, 2008, 387(23):5887-5891.
[19] ZHOUT, REN J, MEDO M, et al. Bipartite network projection and personal recommendation[J]. Physical Review E, 2007, 76(4):1-7.
[20] LIU Z B, QU W Y, LI H T, et al. A hybrid collaborative filtering recommendation mechanism for P2P networks[J]. Future Generation Computer Systems, 2010, 26(8):1409-1417.
[21] ZHAO Z D, SHANG M S. User-based collaborative-filtering recommendation algorithms on Hadoop[C]//Proceedings of the 2010 International Conference on Knowledge Discovery and Data Mining. Piscataway, NJ:IEEE, 2010:478-481.
[22] MA M M, WANG S P. Research of user-based collaborative filtering recommendation algorithm based on Hadoop[C]//Proceedings of the 2015 International Conference on Computer Information Systems and Industrial Applications. Amsterdam:Atlantis Press, 2015:63-66.
[23] SCHELTER S, BODEN C, MARKL V. Scalable similarity-based neighborhood methods with MapReduce[C]//Proceedings of the 2012 ACM Conference on Recommender Systems. New York:ACM, 2012:163-170.
[24] DAS A S, DATAR M, GARG A, et al. Google news personalization:scalable online collaborative filtering[C]//Proceedings of the 2007 International Conference on World Wide Web. New York:ACM, 2007:271-280.
[25] JIANG J, LU J, ZHANG G, et al. Scaling-up item-based collaborative filtering recommendation algorithm based on Hadoop[C]//Proceedings of the 2011 IEEE World Congress on Services. Piscataway, NJ:IEEE, 2011:490-497.

Performance optimization of ItemBased recommendation algorithm based on Spark

基于Spark的ItemBased推荐算法性能优化

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	BAO Xuan, CHEN Hongmei, XIAO Qing. Time-incorporated point-of-interest collaborative recommendation algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2406-2411.
[2]	JIANG Kun, LIU Zheng, ZHU Lei, LI Xiaoxing. Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph [J]. Journal of Computer Applications, 2021, 41(3): 727-732.
[3]	LI Xiangkun, JIA Caiyan. Collaborative filtering method fusing overlapping community regularization and implicit feedback [J]. Journal of Computer Applications, 2021, 41(1): 53-59.
[4]	GOU Zi'an, ZHANG Xiao, WU Dongnan, WANG Yanqiu. Log analysis and workload characteristic extraction in distributed storage system [J]. Journal of Computer Applications, 2020, 40(9): 2586-2593.
[5]	TIAN Baojun, LIU Shuang, FANG Jiandong. Hybrid recommendation algorithm by fusion of topic information and convolution neural network [J]. Journal of Computer Applications, 2020, 40(7): 1901-1907.
[6]	CHEN Xi, MEI Guang, ZHANG Jinjin, XU Weisheng. Student grade prediction method based on knowledge graph and collaborative filtering [J]. Journal of Computer Applications, 2020, 40(2): 595-601.
[7]	DU Ming, YANG Anping, ZHOU Junfeng, CHEN Ziyang, YANG Yun. Optimized algorithm for k-step reachability queries on directed acyclic graphs [J]. Journal of Computer Applications, 2020, 40(2): 426-433.
[8]	ZHANG Wenlong, QIAN Fulan, CHEN Jie, ZHAO Shu, ZHANG Yanping. Collaborative filtering recommendation algorithm based on dual most relevant attention network [J]. Journal of Computer Applications, 2020, 40(12): 3445-3450.
[9]	DONG Cong, ZHANG Xiao, CHENG Wendi, SHI Jia. Performance optimization of distributed file system based on new type storage devices [J]. Journal of Computer Applications, 2020, 40(12): 3594-3603.
[10]	CHENG Wenliang, WANG Zhihong, ZHOU Yu, GUO Yi, ZHAO Junfeng. Design of distributed computing framework for foreign exchange market monitoring [J]. Journal of Computer Applications, 2020, 40(1): 173-180.
[11]	FAN Wei, XIE Cong, XIAO Chunjing, CAO Shuyan. Random walking recommendation algorithm based on combinational category space [J]. Journal of Computer Applications, 2019, 39(4): 984-988.
[12]	LEI Man, GONG Qin, WANG Jichao, WANG Baoqun. Collaborative filtering recommendation algorithm based on tag weight [J]. Journal of Computer Applications, 2019, 39(3): 634-638.
[13]	XU Chao, MENG Fanrong, YUAN Guan, LI Yuee, LIU Xiao. Point-of-Interest recommendation algorithm combining location influence [J]. Journal of Computer Applications, 2019, 39(11): 3178-3183.
[14]	LIU Tong, ZENG Cheng, HE Peng. Housing recommendation method based on user network embedding [J]. Journal of Computer Applications, 2019, 39(11): 3398-3402.
[15]	XU Lingling, QU Zhijian, XU Hongbo, CAO Xiaowei, LIU Xiaohong. Euclidean embedding recommendation algorithm by fusing trust information [J]. Journal of Computer Applications, 2019, 39(10): 2829-2833.