计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2082-2086.DOI: 10.11772/j.issn.1001-9081.2016.08.2082

• 第六届中国数据挖掘会议(CCDM 2016) • 上一篇    下一篇

一种新闻网页关键信息的提取算法

向菁菁1,2, 耿光刚1, 李晓东1   

  1. 1. 中国互联网络信息中心, 北京 100190;
    2. 中国科学院大学 计算机网络信息中心, 北京 100190
  • 收稿日期:2016-03-01 修回日期:2016-05-09 出版日期:2016-08-10 发布日期:2016-08-10
  • 通讯作者: 向菁菁
  • 作者简介:向菁菁(1993-),女,江西九江人,硕士研究生,CCF会员,主要研究方向:数据分析、数据挖掘;耿光刚(1980-),男,山东泰安人,副研究员,博士,CCF会员,主要研究方向:互联网数据分析、海量数据处理;李晓东(1976-),男,山东菏泽人,研究员,博士,CCF会员,主要研究方向:基础网络软件、网络信息安全、互联网数据分析。
  • 基金资助:
    国家自然科学基金面上项目(61375039);中国科学院网络中心一三五重点项目(CNIC_PY_1402)。

Key information extraction algorithm of news Web pages

XIANG Jingjing1,2, GENG Guanggang1, LI Xiaodong1   

  1. 1. China Internet Network Information Center, Beijing 100190, China;
    2. Computer Network Information Center, University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2016-03-01 Revised:2016-05-09 Online:2016-08-10 Published:2016-08-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61375039), the Program (135 Program) of Computer Network Information Center, Chinese Academy of Sciences (CNIC_PY_1402).

摘要: 针对网页正文提取算法缺乏通用性,以及对新闻网页的提取缺乏标题、时间、来源信息的问题,提出一种新闻关键信息的提取算法newsExtractor。该算法首先通过预处理将网页转换成行号和文本的集合,然后根据字数最长的一句话出现在新闻正文的概率极高的特点,从正文中间开始向两端寻找正文的起点和终点提取新闻正文,根据最长公共子串算法提取标题,构造正则表达式并以行号辅助判断提取时间,根据来源的格式特点并辅以行号提取来源;最后构造了数据集与国外开源软件newsPaper进行提取准确率的对比实验。实验结果表明,newsExtractor在正文、标题、时间、来源的平均提取准确率上均优于newsPaper,具有通用性和鲁棒性。

关键词: 网页信息提取, 新闻信息提取, 网页去噪

Abstract: Since information extraction algorithm for Web pages lacks generality and information of title, release-time and source in news Web page, a new information extraction algorithm was proposed to resolve those problems. Firstly, HTML code of Web page was parsed to text sets combined with line number and text; then, extractor began to search boundary of news content from line which the longest sentence belonged to due to the characteristic that the longest sentence belongs to the content of news with an extremely high probability. Meanwhile, the longest common string algorithm was used to extract title, the regular expression and line number were used to extract release-time, and the presentation characteristics of source and line number were used to extract source. Finally, a data set was built to conduct a comparison experiment with an open-source software named newsPaper in accuracy of extraction. Experimental results show that newsExtractor outperforms newsPaper in average accuracy of content, title, release-time and source, it has strong generality and robustness.

Key words: Web information extraction, news information extraction, Web denoising

中图分类号: