Journal of Computer Applications ›› 2011, Vol. 31 ›› Issue (09): 2417-2420.DOI: 10.3724/SP.J.1087.2011.02417
• Database technology • Previous Articles Next Articles
FAN Chun-long,XIA Jia,XIAO Xin,LV Hong-wei,XU Lei
Received:
Revised:
Online:
Published:
Contact:
范纯龙,夏佳,肖昕,吕红伟,徐蕾
通讯作者:
作者简介:
基金资助:
Abstract: Blog is an important kind of network information resources, and the extraction of its comments is the basic work of public opinion analysis researches and of such work. The current mainstream blog comments extraction algorithms were summarized, and the application of page structure in information extraction was described. Using the characteristics of indicating phrases such as the "Home" when people understand Web pages, technology of extracting comments information was proposed by utilizing functional semantic units that they have clear semantics and functional indication. Many technologies involved in the extraction process were detailed such as page structure linearization, functional semantic units recognition, text distinguishment and comments extraction algorithm. Finally, the experimental results show that this technology can achieve better results in extraction of blog body and comments.
Key words: functional semantic unit, information extraction, comment, blog, text distinguishment
摘要: 博客作为一类重要的网络信息资源,其评论信息抽取是舆情分析等研究工作的基础。总结了当前主流的博客评论抽取算法,介绍了页面结构在信息抽取中的应用,并结合人理解网页时充分利用“首页”等指示性短语的特点,提出利用具有明确语义和功能指示作用的功能语义单元来抽取评论信息的技术;详细介绍了抽取过程中涉及的页面结构线性化、功能语义单元识别、正文识别和评论抽取算法等内容。最后,通过实验证明,该技术在博客的正文和评论信息抽取上能取得良好效果。
关键词: 功能语义单元, 信息抽取, 评论, 博客, 正文识别
CLC Number:
TP311.133.1
TP393.094
FAN Chun-long XIA Jia XIAO Xin LV Hong-wei XU Lei. Extraction technology of blog comments based on functional semantic units[J]. Journal of Computer Applications, 2011, 31(09): 2417-2420.
范纯龙 夏佳 肖昕 吕红伟 徐蕾. 基于功能语义单元的博客评论抽取技术[J]. 计算机应用, 2011, 31(09): 2417-2420.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.3724/SP.J.1087.2011.02417
https://www.joca.cn/EN/Y2011/V31/I09/2417