Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (3): 735-739.DOI: 10.11772/j.issn.1001-9081.2016.03.735

Previous Articles     Next Articles

Combining topic similarity with link weight for Web spam ranking detection

WEI Sha, ZHU Yan   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China
  • Received:2015-07-29 Revised:2015-10-03 Online:2016-03-10 Published:2016-03-17
  • Supported by:
    This work is supported by the Academic and Technological Leadership Foundation of Sichuan Province, China.

主题相似度与链接权重相结合的垃圾网页排序检测

韦莎, 朱焱   

  1. 西南交通大学 信息科学与技术学院, 成都 610031
  • 通讯作者: 朱焱
  • 作者简介:韦莎(1989-),女,广西百色人,硕士研究生,主要研究方向:Web数据挖掘;朱焱(1965-),女,广西桂林人,教授,博士,CCF会员,主要研究方向:数据挖掘、Web异常发现、大数据管理与智能分析。
  • 基金资助:
    四川省学术和技术带头人培养资助项目。

Abstract: Focused on the issue that good-to-bad links in the Web degrade the detection performance of ranking algorithms (e.g. Anti-TrustRank), a distrust ranking algorithm—Topic Link Distrust Rank (TLDR) by combining topic similarity with link weight to adjust the propagation was proposed. Firstly, the topic distribution of all the pages was gotten by Latent Dirichlet Allocation (LDA), and the topic similarity of linked pages was computed. Secondly, link weight was computed according to the Web graph, and was combined with topic similarity to achieve the topic-link weight matrix. Then, the Anti-TrustRank and Weighted Anti-TrustRank (WATR) algorithm were improved by measuring the distrust scores correctly based on the topic and link weight. Finally, all the pages were ranked according to their distrust scores, and spam pages were detected by taking a threshold. The experiment results on the dataset WEBSPAM-UK2007 show that, compared with Anti-TrustRank and WATR, SpamFactor of TLDR is raised by 45% and 23.7%, F1-measure (threshold was 600) is improved by 3.4 percentage points and 0.5 percentage points, and spam ration(top 3 of the buckets) is increased by 15 percentage points and 10 percentage points, respectively.

Key words: Web spam detection, link-based spam, ranking algorithm, topic similarity, distrust propagation

摘要: 针对因Web中存在由正常网页指向垃圾网页的链接,导致排序算法(Anti-TrustRank等)检测性能降低的问题,提出了一种主题相似度和链接权重相结合,共同调节网页非信任值传播的排序算法,即主题链接非信任排序(TLDR)。首先,运用隐含狄利克雷分配(LDA)模型得到所有网页的主题分布,并计算相互链接网页间的主题相似度;其次,根据Web图计算链接权重,并与主题相似度结合,得到主题链接权重矩阵;然后,利用主题链接权重调节非信任值传播,改进Anti-TrustRank和加权非信任值排序(WATR)算法,使网页得到更合理的非信任值;最后,将所有网页的非信任值进行排序,通过划分阈值检测出垃圾网页。在数据集WEBSPAM-UK2007上进行的实验结果表明,与Anti-TrustRank和WATR相比,TLDR的SpamFactor分别提高了45%和23.7%,F1-measure(阈值取600)分别提高了3.4个百分点和0.5个百分点, spam比例(前三个桶)分别提高了15个百分点和10个百分点。因此,主题与链接权重相结合的TLDR算法能有效提高垃圾网页检测性能。

关键词: 垃圾网页检测, 链接作弊, 排序算法, 主题相似度, 非信任值传播

CLC Number: