计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 924-927.DOI: 10.11772/j.issn.1001-9081.2017.04.0924

• 大数据与云计算及其应用 • 上一篇    下一篇

级联式低消耗大规模网页分类在线获取方法

王亚强1, 汤铭1, 曾沁2, 唐聃1, 舒红平1   

  1. 1. 成都信息工程大学 软件工程学院, 成都 610225;
    2. 广东省气象台, 广州 510080
  • 收稿日期:2016-10-08 修回日期:2016-11-27 出版日期:2017-04-10 发布日期:2017-04-19
  • 通讯作者: 王亚强
  • 作者简介:王亚强(1984-),男,吉林龙井人,讲师,博士,CCF会员,主要研究方向:大数据、云计算、自然语言处理、机器学习;曾沁(1975-),男,广东梅州人,高级工程师,硕士,主要研究方向:大数据、云计算、精细化预报、气象大数据分析;唐聃(1982-),男,四川成都人,副教授,博士,CCF会员,主要研究方向:大数据、云计算、编码理论;舒红平(1974-),男,重庆人,教授,博士,主要研究方向:大数据、云计算。
  • 基金资助:
    国家自然科学基金资助项目(61501063,61501064);四川省科技计划项目(2016JY0240);四川省教育厅科研基金资助项目(15ZB0177)。

Cascaded and low-consuming online method for large-scale Web page category acquisition

WANG Yaqiang1, TANG Ming1, ZENG Qin2, TANG Dan1, SHU Hongping1   

  1. 1. College of Software Engineering, Chengdu University of Information Technology, Chengdu Sichuan 610225, China;
    2. Guangdong Meteorological Observatory, Guangzhou Guangdong 510080, China
  • Received:2016-10-08 Revised:2016-11-27 Online:2017-04-10 Published:2017-04-19
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61501063, 61501064), the Scientific Research Foundation of Science and Technology Department of Sichuan Province (2016JY0240), the Scientific Research Foundation of Sichuan Education Department (15ZB0177).

摘要: 针对海量网页在线自动高效获取网页分类系统设计中如何更有效地平衡准确度与资源开销之间的矛盾问题,提出一种基于级联式分类器的网页分类方法。该方法利用级联策略,将在线与离线网页分类方法结合,各取所长。级联分类系统的一级分类采用在线分类方法,仅利用锚文本中网页标题包含的特征预测其分类,同时计算分类结果的置信度,分类结果的置信度由分类后验概率分布的信息熵度量。若置信度高于阈值(该阈值采用多目标粒子群优化算法预先计算取得),则触发二级分类器。二级分类器从下载的网页正文中提取特征,利用预先基于网页正文特征训练的分类器进行离线分类。结果表明,相对于单独的在线法和离线法,级联分类系统的F1值分别提升了10.85%和4.57%,并且级联分类系统的效率比在线法未降低很多(30%左右),而比离线法的效率提升了约70%。级联式分类系统不仅具有更高的分类能力,而且显著地减少了分类的计算开销与带宽消耗。

关键词: 大规模网页数据获取, 网页分类, 级联分类器, 置信度函数, 多目标粒子群优化

Abstract: To balance the contradiction between accuracy and resource cost during constructing an automatic system for collecting massive well-classified Web pages, a cascaded and low-consuming online method for large-scale Web page category acquisition was proposed, which utilizes a cascaded strategy to integrate online and offline Web page classifiers so as to take full of use of their advantages. An online Web page classifier trained by features in the anchor text was used as the first-level classifier, and then the confidence of the classification results was computed by the information entropy of the posterior probability. The second-level classifier was triggered when the confidence is larger than the predefined threshold obtained by Multi-Objective Particle Swarm Optimization (MOPSO). The features were extracted from the downloaded Web pages by the secondary classifier, then they were classified by an offline classifier pre-trained by Web pages. In the comparison experiments with single online classification and single offline classification, the proposed method dramatically increased the F1 measure of classification by 10.85% and 4.57% respectively. Moreover, compared with the single online classification, the efficiency of the proposed method did not decrease a lot (less than 30%), while the efficiency was improved about 70% compared with single offline classification. The results demonstrate that the proposed method not only has a more powerful classification ability, but also significantly reduces the computing overhead and bandwidth consumption.

Key words: large scale Web page acquisition, Web page classification, cascaded classifier, confidence function, Multi-Objective Particle Swarm Optimization (MOPSO)

中图分类号: