计算机应用 ›› 2013, Vol. 33 ›› Issue (12): 3368-3371.

• 2013年全国开放式分布与并行计算学术年会(DPCS2013)论文 • 上一篇    下一篇

基于改进K最近邻分类算法的不良网页并行识别

徐雅斌1,2,李卓1,2,陈俊伊1   

  1. 1. 北京信息科技大学 计算机学院,北京 100101;
    2. 网络文化与数字传播北京市重点实验室(北京信息科技大学),北京 100101
  • 收稿日期:2013-07-30 出版日期:2013-12-01 发布日期:2013-12-31
  • 通讯作者: 徐雅斌
  • 作者简介:徐雅斌(1962-),男,辽宁锦州人,教授,CCF会员,主要研究方向:云计算、物联网、下一代互联网;
    李卓(1983-),男,河南南阳人,讲师,CCF会员,主要研究方向:无线网络、移动计算;
    陈俊伊(1984-),女,山东威海人,硕士研究生,主要研究方向:云计算、下一代互联网。
  • 基金资助:
    国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目

Parallel recognition of illegal Web pages based on improved KNN classification algorithm

XU Yabin1,2,LI Zhuo1,2,CHEN Junyi2   

  1. 1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research (Beijing Information Science and Technology University), Beijing 100101, China
    2. Computer School, Beijing Information Science and Technology University, Beijing 100101, China
  • Received:2013-07-30 Online:2013-12-31 Published:2013-12-01
  • Contact: XU Yabin

摘要: 互联网中,黄色、暴力、赌博、反动等不良网页大量存在。如果不进行有效过滤,将给搜索服务带来不良的影响。采用改进的K最近邻分类算法来提高识别的准确率,并在虚拟化平台上通过开源的Hadoop软件所提供的MapReduce模型进行分布式并行处理。对比实验结果表明,所采用的识别方法的识别准确率和识别效率都有较大的提高。

关键词: 不良网页, 文本分类, K最近邻分类算法, Hadoop, MapReduce

Abstract: There are many illegal Web pages on the Internet, which may have pornographic, violent, gambling or reactionary content. Without being filtered effectively, they will exercise a malign influence on the searching services. An improved K-Nearest Neighbors (KNN) classification algorithm to promote the recognition accuracy was proposed and implemented on a virtualized platform following the MapReduce model provided by the open source software Hadoop, which made it distributed and parallel. Through experiments and comparison with the existing work, it is proved that the proposed recognition method improves the accuracy and efficiency greatly. The algorithm is implemented on a virtualized platform following the MapReduce model provided by the open source software Hadoop, which makes it distributed and parallel. Through experiments and comparison with existing work, it is proved that the recognition method we propose improves the accuracy and efficiency greatly.

Key words: illegal Web page, text classification, K-Nearest Neighbors (KNN) classification algorithm, Hadoop, MapReduce

中图分类号: