计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 163-170.DOI: 10.11772/j.issn.1001-9081.2016.01.0163

• 人工智能 • 上一篇    下一篇

健康领域Web信息抽取

李汝君, 张俊, 张晓民, 桂小庆   

  1. 大连海事大学 信息科学技术学院, 辽宁 大连 116026
  • 收稿日期:2015-07-01 修回日期:2015-08-12 出版日期:2016-01-10 发布日期:2016-01-09
  • 通讯作者: 张俊(1971-),男,湖北崇阳人,教授,博士,CCF高级会员,主要研究方向:数据库与信息检索、智能信息处理
  • 作者简介:李汝君(1990-),男,湖南衡阳人,硕士研究生,主要研究方向:数据库与信息检索、智能信息处理;张晓民(1991-),男,山东潍坊人,硕士研究生,主要研究方向:数据库与信息检索、智能信息处理;桂小庆(1991-),女,安徽安庆人,硕士研究生,主要研究方向:数据库与信息检索、智能信息处理。
  • 基金资助:
    国家自然科学基金资助项目(61073057)。

Web information extraction in health field

LI Rujun, ZHANG Jun, ZHANG Xiaomin, GUI Xiaoqing   

  1. College of Information Science and Technology, Dalian Maritime University, Dalian Liaoning 116026, China
  • Received:2015-07-01 Revised:2015-08-12 Online:2016-01-10 Published:2016-01-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61073057).

摘要: 针对Web信息抽取(WIE)技术在健康领域应用的问题,提出了一种基于WebHarvest的健康领域Web信息抽取方法。通过对不同健康网站的结构分析设计健康实体的抽取规则,实现了基于WebHarvest的自动抽取健康实体及其属性的算法;再把抽取的实体及其属性进行一致性检查后存入关系数据库中,然后对关系数据库中隐含健康实体的属性值利用Ansj自然语言处理方法进行实体识别, 进而抽取健康实体之间的联系。该技术在健康实体抽取实验中,平均F值达到99.9%,在实体联系抽取实验中,平均F值达到80.51%。实验结果表明提出的Web信息抽取技术在健康领域抽取的健康信息具有较高的质量和可信性。

关键词: 信息抽取, 健康信息抽取, 一致性检查, 实体识别, 实体联系抽取

Abstract: For the question how to apply the Web Information Extraction (WIE) technology to health field, a Web information extraction method based on WebHarvest was proposed. Through the structure analysis of different health Web sites and the design of health entity extraction rules, the automatic extraction algorithm of health entity and its attributes based on WebHarvest was realized; then they were stored in a relational database after consistency check; in the end, the values of entity attributes were analyzed to recognize entities by using processing method of natural language Ansj to extract relationship among entities. In the health entity extraction experiments, the average F-measure of the technology reached 99.9%; in the entity contact extraction experiments, the average F-measure reached 80.51%. The experimental results show that the proposed Web information extraction technology has high quality and credibility in the health information extraction.

Key words: information extraction, health information extraction, consistency check, entity recognition, entity relationship extraction

中图分类号: