一种新闻网页关键信息的提取算法

doi:10.11772/j.issn.1001-9081.2016.08.2082

计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2082-2086.DOI: 10.11772/j.issn.1001-9081.2016.08.2082

• 第六届中国数据挖掘会议(CCDM 2016) • 上一篇下一篇

一种新闻网页关键信息的提取算法

向菁菁^1,2, 耿光刚¹, 李晓东¹

1. 中国互联网络信息中心, 北京 100190;
2. 中国科学院大学计算机网络信息中心, 北京 100190

收稿日期:2016-03-01 修回日期:2016-05-09 出版日期:2016-08-10 发布日期:2016-08-10
通讯作者: 向菁菁
作者简介:向菁菁(1993-),女,江西九江人,硕士研究生,CCF会员,主要研究方向:数据分析、数据挖掘;耿光刚(1980-),男,山东泰安人,副研究员,博士,CCF会员,主要研究方向:互联网数据分析、海量数据处理;李晓东(1976-),男,山东菏泽人,研究员,博士,CCF会员,主要研究方向:基础网络软件、网络信息安全、互联网数据分析。
基金资助:
国家自然科学基金面上项目（61375039）；中国科学院网络中心一三五重点项目（CNIC_PY_1402）。

Key information extraction algorithm of news Web pages

XIANG Jingjing^1,2, GENG Guanggang¹, LI Xiaodong¹

1. China Internet Network Information Center, Beijing 100190, China;
2. Computer Network Information Center, University of Chinese Academy of Sciences, Beijing 100190, China

Received:2016-03-01 Revised:2016-05-09 Online:2016-08-10 Published:2016-08-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61375039), the Program (135 Program) of Computer Network Information Center, Chinese Academy of Sciences (CNIC_PY_1402).

摘要/Abstract

摘要： 针对网页正文提取算法缺乏通用性，以及对新闻网页的提取缺乏标题、时间、来源信息的问题，提出一种新闻关键信息的提取算法newsExtractor。该算法首先通过预处理将网页转换成行号和文本的集合，然后根据字数最长的一句话出现在新闻正文的概率极高的特点，从正文中间开始向两端寻找正文的起点和终点提取新闻正文，根据最长公共子串算法提取标题，构造正则表达式并以行号辅助判断提取时间，根据来源的格式特点并辅以行号提取来源；最后构造了数据集与国外开源软件newsPaper进行提取准确率的对比实验。实验结果表明，newsExtractor在正文、标题、时间、来源的平均提取准确率上均优于newsPaper，具有通用性和鲁棒性。

关键词: 网页信息提取, 新闻信息提取, 网页去噪

Abstract: Since information extraction algorithm for Web pages lacks generality and information of title, release-time and source in news Web page, a new information extraction algorithm was proposed to resolve those problems. Firstly, HTML code of Web page was parsed to text sets combined with line number and text; then, extractor began to search boundary of news content from line which the longest sentence belonged to due to the characteristic that the longest sentence belongs to the content of news with an extremely high probability. Meanwhile, the longest common string algorithm was used to extract title, the regular expression and line number were used to extract release-time, and the presentation characteristics of source and line number were used to extract source. Finally, a data set was built to conduct a comparison experiment with an open-source software named newsPaper in accuracy of extraction. Experimental results show that newsExtractor outperforms newsPaper in average accuracy of content, title, release-time and source, it has strong generality and robustness.

Key words: Web information extraction, news information extraction, Web denoising

中图分类号:

TP391

向菁菁, 耿光刚, 李晓东. 一种新闻网页关键信息的提取算法[J]. 计算机应用, 2016, 36(8): 2082-2086.

XIANG Jingjing, GENG Guanggang, LI Xiaodong. Key information extraction algorithm of news Web pages[J]. Journal of Computer Applications, 2016, 36(8): 2082-2086.

参考文献

[1] COWIE J,LEHNERT W.Information extraction[J].Communications of the ACM,1996,39(1):80-91.
[2] MOONEY R J,BUNESCU R.Mining knowledge from text using information extraction[J].ACM SIGKDD Explorations Newsletter,2005,7(1):3-10.
[3] CHANG C-H,LUI S-C.IEPAD:information extraction based on pattern discovery[C]//WWW'01:Proceedings of the 10th International Conference on World Wide Web.New York:ACM,2001:681-688.
[4] BANKO M,CAFARELLA M J,SODERLAND S,et al.Open information extraction from the Web[C]//IJCAI 2007:Proceedings of the 20th International Joint Conference on Artificial Intelligence.Menlo Park,CA:AAAI Press,2007:2670-2676.
[5] BAUMGARTNER R,FLESCA S,GOTTLOB G.Visual Web information extraction with Lixto[C]//VLDB'01:Proceedings of the 27th International Conference on Very Large Data Bases.San Francisco,CA:Morgan Kaufmann,2001:119-128.
[6] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.(SUN C J,GUAN Y.A statistical approach for content extraction from Web page[J].Journal of Chinese Information Processing,2004,18(5):17-22.)
[7] 赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145.(ZHANG X X,SUO H G,LIU Y S.Web content information extraction method based on tag window[J].Application Research of Computers,2007,24(3):144-145.)
[8] 王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139.(WANG L,JIANG J Z,GUO J L.Information extraction from Web page based on extended DOM tree[J].Computer Applications and Software,2007,24(6):137-139.)
[9] GOTTLOB G,KOCH C.Logic-based Web information extraction[J].ACM SIGMOD Record,2004,33(2):87-94.
[10] 梅雪,程学旗,郭岩,等.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29.(MEI X,CHENG X Q,GUO Y.Fully automatic wrapper generation for Web information extraction[J].Journal of Chinese Information Processing,2008,22(1):22-29.)
[11] 宋明秋,张瑞雪,吴新涛,等.网页正文信息抽取新方法[J].大连理工大学学报,2009,49(4):594-597.(SONG M Q,ZHANG R X,WU X T,et.al.A new approach to content extraction from Web page[J].Journal of Dalian University of Technology,2009,49(4):594-597.)
[12] 周佳颖,朱珍民,高晓芳.基于统计与正文特征的中文网页正文抽取研究[J].中文信息学报,2009,23(5):80-85.(ZHOU J Y,ZHU Z M,GAO X F.Research on content extraction from Chinese Web page based on statistic and content-features[J].Journal of Chinese Information Processing,2009,23(5):80-85.)
[13] 张瑞雪,宋明秋,公衍磊.逆序解析DOM树及网页正文信息提取[J].计算机科学,2011,38(4):213-215.(ZHANG R X,SONG M Q,GONG Y L.Parsing DOM tree reversely and extracting Web main page information[J].Computer Science,2011,38(4):213-215.)
[14] DEBNATH S,MITRA P,PAL N,et al.Automatic identification of informative sections of Web pages[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(9):1233-1246.
[15] 杨柳青,李晓东,耿光刚.基于布局相似性的网页正文内容提取研究[J].计算机应用研究,2015,32(9):2581-2586.(YANG L Q,LI X D,GENG G G.Study of Web pages content extraction based on layout similarity[J].Application Research of Computers,2015,32(9):2581-2586.)
[16] 丁宝琼,谢远平,吴琼.基于改进DOM树的网页去噪声方法[J].计算机应用,2009,29(6):175-177.(DING B Q,XIE Y P,WU Q.Noise elimination method in Web page based on improved DOM tree[J].Journal of Computer Applications,2009,29(6):175-177.)
[17] 邵俊.基于视觉热区的网页内容抽取方法[J].计算机应用与软件,2012,29(6):199-201.(SHAO J.Web pages content extraction based on visual hot zone[J].Computer Applications and Software,2012,29(6):199-201.)
[18] 李霞,蒋盛益.基于DOM树及行文本统计去噪的网页文本抽取技术[J].山东大学学报(理学版),2012,47(3):38-42.(LI X,JIANG S Y.Content extraction from Web page based on the DOM tree and line-text statistical noise-elimination[J].Journal of Shandong University (Natural Science),2012,47(3):38-42.)
[19] 吴麒,陈兴蜀,谭骏.基于权值优化的网页正文内容提取算法[J].华南理工大学学报(自然科学版),2011,39(4):32-37.(WU L, CHEN X S, TAN J. Content extraction algorithm of HTML pages based on optimized weight[J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39(4):32-37.)
[20] Jsoup[EB/OL].[2015-12-03]. http://jsoup.org/.
[21] 中国互联网络信息中心.第35次中国互联网络发展状况统计报告[R/OL].[2015-02-03]. http://www.cnnic.cn/hlwfzyj/hlwxzbg/201502/P020150203551802054676.pdf. (China Internet Network Information Center. The 35th development statistics report of china internet network[R/OL].[2015-02-03]. http://www.cnnic.cn/hlwfzyj/hlwxzbg/201502/P020150203551802054676.pdf.)
[22] Readability. Read comfortably-anytime, anywhere[EB/OL].[2015-12-05]. https://www.readability.com/.
[23] Newspaper:Article scraping & curation[EB/OL].[2015-11-06]. http://newspaper.readthedocs.org/en/latest/.

一种新闻网页关键信息的提取算法

Key information extraction algorithm of news Web pages

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	章悦, 张亮, 谢非, 杨嘉乐, 张瑞, 刘益剑. 基于实例分割模型优化的道路抛洒物检测算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3228-3233.
[2]	李凯, 李洁. 基于pinball损失的结构模糊多分类支持向量机算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3104-3112.
[3]	胡誉生, 何炳蔚, 邓清康. 混合视觉系统的运动物体检测和静态地图重建[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3332-3336.
[4]	高洁, 朱元, 陆科. 基于雷达和相机融合的目标检测方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3242-3250.
[5]	彭博, 罗娅茹, 谢盛华, 尹立雪. 联合深度学习的通用血流向量成像方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3368-3375.
[6]	陈吉成, 陈鸿昶. 基于张量建模和进化K均值聚类的社区检测方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3120-3126.
[7]	张嘉琪, 张月琴, 陈健. 优化强化学习路径特征分类的脉象识别法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3402-3408.
[8]	任俊伟, 曾诚, 肖丝雨, 乔金霞, 何鹏. 基于会话的多粒度图神经网络推荐模型[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3164-3170.
[9]	孙琳, 袁玉波. 基于人眼状态的瞌睡识别算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3213-3218.
[10]	葛晨宇, 董良, 许伊昆, 常毅, 张宏鸣. 基于总变分低秩组稀疏的全球雷达数据修复算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3353-3361.
[11]	闫钧华, 侯平, 张寅, 吕向阳, 马越, 王高飞. 基于多尺度多分类器卷积神经网络的混合失真类型判定方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3178-3184.
[12]	李福海, 蒋慕蓉, 杨磊, 谌俊毅. 基于生成对抗网络的梯度引导太阳斑点图像去模糊方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3345-3352.
[13]	曹建芳, 闫敏敏, 贾一鸣, 田晓东. 融合迁移学习的Inception-v3模型在古壁画朝代识别中的应用[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3219-3227.
[14]	刘太亨, 何昭水. 基于自编码和知识蒸馏的表面缺陷检测方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3200-3205.
[15]	张阳, 王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3151-3155.