计算机应用 ›› 2011, Vol. 31 ›› Issue (09): 2417-2420.DOI: 10.3724/SP.J.1087.2011.02417

• 数据库技术 • 上一篇    下一篇

基于功能语义单元的博客评论抽取技术

范纯龙,夏佳,肖昕,吕红伟,徐蕾   

  1. 沈阳航空航天大学 计算机学院,沈阳 110136
  • 收稿日期:2011-03-29 修回日期:2011-05-31 发布日期:2011-09-01 出版日期:2011-09-01
  • 通讯作者: 范纯龙
  • 作者简介:范纯龙(1973-),男,辽宁营口人,副教授,硕士,主要研究方向:信息安全、入侵检测;
    夏佳(1986-),男,湖南益阳人,硕士研究生,主要研究方向:信息安全、访问控制;
    肖昕(1988-),男,湖南娄底人,硕士研究生,主要研究方向:信息安全、访问控制;
    吕红伟(1988-),男,吉林白山人,硕士研究生,主要研究方向:信息安全、访问控制;
    徐蕾(1959-),女,辽宁沈阳人,教授,主要研究方向:信息安全、访问控制。
  • 基金资助:
    辽宁省教育厅基金资助项目(2009B140)

Extraction technology of blog comments based on functional semantic units

FAN Chun-long,XIA Jia,XIAO Xin,LV Hong-wei,XU Lei   

  1. College of Computer Science, Shenyang Aerospace University, Shenyang Liaoning 110136,China
  • Received:2011-03-29 Revised:2011-05-31 Online:2011-09-01 Published:2011-09-01
  • Contact: FAN Chun-long

摘要: 博客作为一类重要的网络信息资源,其评论信息抽取是舆情分析等研究工作的基础。总结了当前主流的博客评论抽取算法,介绍了页面结构在信息抽取中的应用,并结合人理解网页时充分利用“首页”等指示性短语的特点,提出利用具有明确语义和功能指示作用的功能语义单元来抽取评论信息的技术;详细介绍了抽取过程中涉及的页面结构线性化、功能语义单元识别、正文识别和评论抽取算法等内容。最后,通过实验证明,该技术在博客的正文和评论信息抽取上能取得良好效果。

关键词: 功能语义单元, 信息抽取, 评论, 博客, 正文识别

Abstract: Blog is an important kind of network information resources, and the extraction of its comments is the basic work of public opinion analysis researches and of such work. The current mainstream blog comments extraction algorithms were summarized, and the application of page structure in information extraction was described. Using the characteristics of indicating phrases such as the "Home" when people understand Web pages, technology of extracting comments information was proposed by utilizing functional semantic units that they have clear semantics and functional indication. Many technologies involved in the extraction process were detailed such as page structure linearization, functional semantic units recognition, text distinguishment and comments extraction algorithm. Finally, the experimental results show that this technology can achieve better results in extraction of blog body and comments.

Key words: functional semantic unit, information extraction, comment, blog, text distinguishment

中图分类号: