主题相似度与链接权重相结合的垃圾网页排序检测

doi:10.11772/j.issn.1001-9081.2016.03.735

计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 735-739.DOI: 10.11772/j.issn.1001-9081.2016.03.735

主题相似度与链接权重相结合的垃圾网页排序检测

韦莎, 朱焱

西南交通大学信息科学与技术学院, 成都 610031

收稿日期:2015-07-29 修回日期:2015-10-03 出版日期:2016-03-10 发布日期:2016-03-17
通讯作者: 朱焱
作者简介:韦莎(1989-),女,广西百色人,硕士研究生,主要研究方向:Web数据挖掘;朱焱(1965-),女,广西桂林人,教授,博士,CCF会员,主要研究方向:数据挖掘、Web异常发现、大数据管理与智能分析。
基金资助:
四川省学术和技术带头人培养资助项目。

Combining topic similarity with link weight for Web spam ranking detection

WEI Sha, ZHU Yan

School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China

Received:2015-07-29 Revised:2015-10-03 Online:2016-03-10 Published:2016-03-17
Supported by:
This work is supported by the Academic and Technological Leadership Foundation of Sichuan Province, China.

摘要/Abstract

摘要： 针对因Web中存在由正常网页指向垃圾网页的链接,导致排序算法(Anti-TrustRank等)检测性能降低的问题,提出了一种主题相似度和链接权重相结合,共同调节网页非信任值传播的排序算法,即主题链接非信任排序(TLDR)。首先,运用隐含狄利克雷分配(LDA)模型得到所有网页的主题分布,并计算相互链接网页间的主题相似度;其次,根据Web图计算链接权重,并与主题相似度结合,得到主题链接权重矩阵;然后,利用主题链接权重调节非信任值传播,改进Anti-TrustRank和加权非信任值排序(WATR)算法,使网页得到更合理的非信任值;最后,将所有网页的非信任值进行排序,通过划分阈值检测出垃圾网页。在数据集WEBSPAM-UK2007上进行的实验结果表明,与Anti-TrustRank和WATR相比,TLDR的SpamFactor分别提高了45%和23.7%,F1-measure(阈值取600)分别提高了3.4个百分点和0.5个百分点, spam比例(前三个桶)分别提高了15个百分点和10个百分点。因此,主题与链接权重相结合的TLDR算法能有效提高垃圾网页检测性能。

关键词: 垃圾网页检测, 链接作弊, 排序算法, 主题相似度, 非信任值传播

Abstract: Focused on the issue that good-to-bad links in the Web degrade the detection performance of ranking algorithms (e.g. Anti-TrustRank), a distrust ranking algorithm—Topic Link Distrust Rank (TLDR) by combining topic similarity with link weight to adjust the propagation was proposed. Firstly, the topic distribution of all the pages was gotten by Latent Dirichlet Allocation (LDA), and the topic similarity of linked pages was computed. Secondly, link weight was computed according to the Web graph, and was combined with topic similarity to achieve the topic-link weight matrix. Then, the Anti-TrustRank and Weighted Anti-TrustRank (WATR) algorithm were improved by measuring the distrust scores correctly based on the topic and link weight. Finally, all the pages were ranked according to their distrust scores, and spam pages were detected by taking a threshold. The experiment results on the dataset WEBSPAM-UK2007 show that, compared with Anti-TrustRank and WATR, SpamFactor of TLDR is raised by 45% and 23.7%, F1-measure (threshold was 600) is improved by 3.4 percentage points and 0.5 percentage points, and spam ration(top 3 of the buckets) is increased by 15 percentage points and 10 percentage points, respectively.

Key words: Web spam detection, link-based spam, ranking algorithm, topic similarity, distrust propagation

中图分类号:

TP181

韦莎, 朱焱. 主题相似度与链接权重相结合的垃圾网页排序检测[J]. 计算机应用, 2016, 36(3): 735-739.

WEI Sha, ZHU Yan. Combining topic similarity with link weight for Web spam ranking detection[J]. Journal of Computer Applications, 2016, 36(3): 735-739.

参考文献

[1] ERDÉLYI M, GARZÓ A, BENCZUR A A. Web spam classification: a few features worth more [C]//WebQuality'11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality. New York: ACM, 2011:27-34.
[2] 李智超, 余慧佳, 刘奕群,等. 网页作弊与反作弊技术综述[J].山东大学学报(理学版),2011,46(5):1-8.(LI Z C, YU H J, LIU Y Q, et al. A survey of Web spam and anti-spam techniques [J]. Journal of Shandong University (Natural Science), 2011,46(5):1-8.)
[3] LENG A G K, SINGH A K, KUMAR P R, et al. TPRank: contend with Web spam using trust propagation [J]. Cybernetics and Systems, 2014, 45(4):307-323.
[4] GOH K L, PATCHMUTHU R K, SINGH A K. Link-based Web spam detection using weight properties [J]. Journal of Intelligent Information System, 2014,43(1):129-145.
[5] WU B, GOEL V, DAVISON B D. Propagating trust and distrust to demote Web spam [EB/OL]. [2015-04-11]. http://vesta.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-190/paper03.pdf.
[6] NIE L, WU B, DAVISON B D. Winnowing wheat from the chaff: propagating trust to sift spam from the Web [C]//SIGIR'07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007:869-870.
[7] LIU X, WANG Y, ZHU S, et al. Combating Web spam through trust-distrust propagation with confidence [J]. Pattern Recognition Letters, 2013,34(13):1462-1469.
[8] ZHANG X, WANG Y, MOU N, et al. Propagating both trust and distrust with target differentiation for combating link-based Web spam [J]. ACM Transactions on the Web, 2014,8(3):881-904.
[9] YU H, LIU Y, ZHANG M, et al. Web spam identification with user browsing graph [C]//LEE G G, SONG D, LIN C Y, et al. Information Retrieval Technology, LNCS 5839. Berlin: Springer, 2009:38-49.
[10] MARTINEZ-ROMO J, ARAUJO L. Web spam identification through language model analysis [C]//AIRWeb'09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. New York: ACM, 2009:21-28.
[11] DONG C, ZHOU B. Effectively detecting content spam on the Web using topical diversity measures [C]//WI-IAT'12: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology. Washington, DC: IEEE Computer Society, 2012,1:266-273.
[12] SUHARA Y, TODA H, NISHIOKA S, et al. Automatically generated spam detection based on sentence-level topic information [C]//WWW'13 Companion: Proceedings of the 22nd International Conference on World Wide Web. Geneva: International World Wide Web Conferences Steering Committee, 2013:1157-1160.

主题相似度与链接权重相结合的垃圾网页排序检测

Combining topic similarity with link weight for Web spam ranking detection

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

[1]	陈木生, 卢晓勇. 三种用于垃圾网页检测的随机欠采样集成分类器[J]. 计算机应用, 2017, 37(2): 535-539.
[2]	卢晓勇, 陈木生, 吴政隆, 张百栈. 基于免疫克隆特征选择和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(7): 1899-1903.
[3]	卢晓勇, 陈木生. 基于随机森林和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(3): 731-734.
[4]	陈鑫, 王素格, 廖健. 基于词语相关度的微博新情感词自动识别[J]. 计算机应用, 2016, 36(2): 424-427.
[5]	王小娟周竹荣. 基于leader-follower算法的超级节点研究[J]. 计算机应用, 2012, 32(01): 143-146.