面向导航型网页关键词自动抽取的视觉模型与算法

doi:10.3724/SP.J.1087.2012.02360

计算机应用 ›› 2012, Vol. 32 ›› Issue (08): 2360-2368.DOI: 10.3724/SP.J.1087.2012.02360

面向导航型网页关键词自动抽取的视觉模型与算法

彭浩¹,蔡美玲¹,²,陈继锋¹,刘炽³,余炳锐¹

1. 湖南涉外经济学院计算机科学与技术学院，长沙 410205
2. 中南大学信息科学与工程学院，长沙 410083
3. 中国电力出版社用电技术出版中心，北京 100005

收稿日期:2011-11-11 修回日期:2012-01-31 发布日期:2012-08-28 出版日期:2012-08-01
通讯作者: 彭浩
作者简介:彭浩(1978-)，男，湖南长沙人，讲师，硕士，主要研究方向：Web信息获取与处理、实时调度;
蔡美玲(1982-)，女，湖南长沙人，讲师，博士研究生，主要研究方向：网络计算、智能信息处理、图形图像信息处理;
陈继锋(1966-)，男，湖南浏阳人，教授，博士，主要研究方向：软件测试自动化;
刘炽(1979-),男,湖南沅陵人,工程师,硕士研究生,主要研究方向:软件工程、嵌入式系统;
余炳锐(1988-)，男，湖南怀化人，主要研究方向：Web信息获取与处理。
基金资助:
国家自然科学基金资助项目(60803024);湖南省自然科学基金资助项目(10JJ6092);湖南省大学生研究性学习和创新性实验计划项目(湘教通[2011]272号，编号：393)

Visual representation model and automatic keywords extraction algorithm for hub Web pages

1. College of Computer Science and Technology, Hunan International Economics University, Changsha Hunan 410205，China
2. School of Information Science and Engineering, Central South University, Changsha Hunan 410083, China
3.

Received:2011-11-11 Revised:2012-01-31 Online:2012-08-28 Published:2012-08-01
Contact: PENG Hao

摘要/Abstract

摘要： 导航型网页中往往包含了大量的噪声信息，为自动提取网页中的关键词带来了较大的困难。为此，提出一个新的网页表示模型PIX-PAGE和导航型网页关键词自动抽取算法P-KEA。PIX-PAGE模型利用提出的区域合并算法，将一张网页分割为适当粒度的区域；然后，依据人类视觉特点，对各区域进行视觉“奇异性”量化，同时利用奇异性传递规则进一步强化关键词相关区域的视觉“奇异性”。P-KEA根据PIX-PAGE模型模型的视觉量化结果，能够较准确地找到视觉突出区域中的关键词。实验结果表明，与基于DocView模型的算法DVM相比，P-KEA的准确率平均提高了20.9%。

关键词: 区域合并, 视觉量化, 网页表示模型, 关键词自动抽取

Abstract: It is very hard to exactly extract keywords from hub Web pages because of its topic noise. To resolve this problem, a new sub Web page representation model and its automatic keywords extraction algorithm were proposed in this paper. At first, the new model segmented Web page into some blocks by using the block composition algorithm. Secondly, according to the visual recognition method of humanity, the new model computed the visual measurement of these blocks. At the same time, the transmission rule of visual measurement made blocks special where keywords were contained more specially. The automatic keywords extraction algorithm could exactly find these keywords in the most special blocks. The experimental results show that the proposed algorithm has bumped up by 20.9% on average in accuracy compared with keywords extraction algorithm based on DocView model.

Key words: block composition, visual characteristic measurement, Web page representation model, automatic keywords extraction

中图分类号:

TP391.4

彭浩蔡美玲陈继锋刘炽余炳锐. 面向导航型网页关键词自动抽取的视觉模型与算法[J]. 计算机应用, 2012, 32(08): 2360-2368.

参考文献

[1]CHAKRABARTI S, van den BERG M, DOM B. Focused crawling: a new approach to topic-specific Web resource discovery [J].Computer Networks, 1999,31(11-16): 1623-1640. [2]CHAU M, CHEN H. Incorporating Web analysis into neural networks: an example in Hopfield net searching [J].IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2007, 37(3):352-358. [3]周立柱，林玲.聚焦爬虫技术研究综述[J].JOCA，2005，25(9):1965-1969. [4]WU XIAOYUAN, BOLIVAR A. Keyword extraction for contextual advertisement [C]// WWW'08: Proceedings of the 17th International Conference on World Wide Web. New York: ACM, 2008:1195-1196. [5]刘远超，王晓龙，刘秉权，等．信息检索中的聚类分析技术[J].电子与信息学报，2006，28(4): 606-609. [6]陈竹敏.面向垂直搜索引擎的主题爬行技术研究[D]. 济南，山东大学，2008. [7]李晓明，闫宏飞，王继民.搜索引擎——原理、技术与系统[M].北京：科学出版社，2006:98-103. [8]韩客松，王永成，滕伟. Web页面中文文本主题的自动提取研究[J].情报学报，2001，20(2):217-222. [9]任克强，赵光甫，张国萍.基于带权语言网络的网页关键词抽取[J].计算机工程与应用，2008，44(8):155-157. [10]BUYUKKOKTEN O, GARCM-MOLINA H，PAEPCKE A. Seeing the whole in parts: text summarization for Web browsing on handheld devises [C]// WWW'01: Proceedings of the 10th International Conference on World Wide Web. New York: ACM, 2001：652-662. [11]王琦，唐世渭，杨冬青，等．基于DOM的网页主题信息自动提取[J].计算机研究与发展，2004，41(10)：1786-1791. [12]CAI DENG, YU SHIPENG, WEN JI-RONG, et al. VIPS: a vision-based page segmentation algorithm, MSR-TR-2003-79 [R]. Redmond: Microsoft Research Corporation, 2003.

面向导航型网页关键词自动抽取的视觉模型与算法

Visual representation model and automatic keywords extraction algorithm for hub Web pages

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 11

编辑推荐

Metrics

[1]	杨瑞, 钱晓军, 孙振强, 许振. 自然场景下多区域特征融合的混合航拍图像分割算法[J]. 计算机应用, 2021, 41(8): 2445-2452.
[2]	邓旭, 徐新, 董浩. 单极化合成孔径雷达图像颜色特征编码与分类[J]. 计算机应用, 2018, 38(7): 2056-2063.
[3]	韩建栋, 朱婷婷, 李月香. 结合粗糙集与分层思想的彩色图像分割算法[J]. 计算机应用, 2015, 35(7): 2020-2024.
[4]	许蓓蕾庄奕琪汤华莲张丽田进寿. 基于对象的多级图像增强法[J]. 计算机应用, 2011, 31(06): 1556-1559.
[5]	冷美萍鲍苏苏. 基于色调直方图和区域合并的彩色图像分割算法[J]. 计算机应用, 2010, 30(3): 653-656.
[6]	李光王朝英侯志强. 基于K均值聚类与区域合并的彩色图像分割算法[J]. 计算机应用, 2010, 30(2): 354-358.
[7]	沈清波吴炜杨晓敏何小海. 分水岭变换在岩屑图像分割中的应用[J]. 计算机应用, 2009, 29(10): 2859-2861.
[8]	陈家新吴颖黎蔚. 基于各向异性扩散的医学图像分水岭分割算法[J]. 计算机应用, 2008, 28(6): 1527-1529.
[9]	张平; 王文伟; 吴丽芸. 基于均匀性图分水岭变换及两步区域合并的彩色图像分割[J]. 计算机应用, 2006, 26(6): 1378-1380.
[10]	殷海明，张明敏，潘志庚. 一种织物彩色图像的分割算法[J]. 计算机应用, 2005, 25(04): 966-967.
[11]	屈伸，王庆，池哲儒. 基于迭代神经网络的图像结构表示和分类[J]. 计算机应用, 2005, 25(04): 766-768.