Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (1): 48-52.DOI: 10.11772/j.issn.1001-9081.2015.01.0048

Previous Articles     Next Articles

PageRank parallel algorithm based on Web link classification

CHEN Cheng1, ZHAN Yinwei1, LI Ying2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou Guangdong 511400, China;
    2. Institute of Digital Guangdong, Guangzhou Guangdong 510000, China
  • Received:2014-07-17 Revised:2014-08-21 Online:2015-01-01 Published:2015-01-26

基于网页链接分类的PageRank并行算法

陈诚1, 战荫伟1, 李鹰2   

  1. 1. 广东工业大学 计算机学院, 广州511400;
    2. 广东省数字广东研究院, 广州510000
  • 通讯作者: 陈诚
  • 作者简介:陈诚(1989-),男,四川简阳人,硕士研究生,主要研究方向:云计算、分布式应用、数据挖掘;战荫伟(1966-),男,吉林长春人,教授级高级工程师,博士,主要研究方向:图像处理、模式识别、数据挖掘;李鹰(1958-),男,黑龙江哈尔滨人,教授级高级工程师,博士,主要研究方向:遥感技术、物联网、大数据分析.

Abstract:

Concerning the problem that the efficiency of serial PageRank algorithm is low in dealing with mass Web data, a PageRank parallel algorithm based on Web link classification was proposed. Firstly, the Web was classified according to its Web link, and the weights of different Web which was from diverse websites were set variously. Secondly, with the Hadoop parallel computation platform and MapReduce which has the characteristics of dividing and conquering, the Webpage ranks were computed parallel. At last, a data compression method of three layers including data layer, pretreatment layer and computation layer was adopted to optimize the parallel algorithm. The experimental results show that, compared with the serial PageRank algorithm, the accuracy of the proposed algorithm is improved by 12% and the efficiency is improved by 33% in the best case.

Key words: link classification, Hadoop, PageRank, MapReduce, data compression

摘要:

针对串行PageRank算法在处理海量网页数据时效率低下的问题,提出一种基于网页链接分类的PageRank并行算法.首先,将网页按照网页所属网站分类,为来自不同站点的网页设置不同的权重;其次,利用Hadoop并行计算框架,结合MapReduce分而治之的特点,并行计算网页排名;最后,采用一种包含3层:数据层、预处理层、计算层的数据压缩方法,对并行算法进行优化.实验结果表明,与串行PageRank算法相比,所提算法在最好情况下结果准确率提高了12%,计算效率提高了33%.

关键词: 链接分类, Hadoop, PageRank, MapReduce, 数据压缩

CLC Number: