计算机应用 ›› 2014, Vol. 34 ›› Issue (1): 64-68.DOI: 10.11772/j.issn.1001-9081.2014.01.0064

• 先进计算 • 上一篇    下一篇

基于弱监督的属性关系抽取方法

杨宇飞,戴齐,贾真,尹红风   

  1. 西南交通大学 信息科学与技术学院,成都 610031
  • 收稿日期:2013-07-29 修回日期:2013-09-12 出版日期:2014-01-01 发布日期:2014-02-14
  • 通讯作者: 贾真
  • 作者简介:杨宇飞(1988-),男,河南驻马店人,硕士研究生,主要研究方向:信息抽取;戴齐(1963-),男,四川成都人,副教授,主要研究方向:数据挖掘、智能信息处理;贾真(1975-),女,河南开封人,讲师,硕士,主要研究方向:信息抽取,内容安全;尹红风(1964-),男,河南商丘人,教授,博士,主要研究方向:大数据、语义搜索。
  • 基金资助:

    国家自然科学基金资助项目;中央高校基本科研业务费专项资金资助项目;中国科学院自动化所复杂系统管理与控制重点实验室开放课题

Weakly supervised method for attribute relation extraction

YANG Yufei,DAI Qi,JIA Zhen,YI Hongfeng   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China
  • Received:2013-07-29 Revised:2013-09-12 Online:2014-01-01 Published:2014-02-14
  • Contact: JIA Zhen

摘要: 针对从中文百科中抽取属性关系时所面临的训练语料匮乏问题,提出一种利用极少人工参与的弱监督自动抽取方法。首先,利用中文百科条目信息模板中的半结构化属性关系回标条目文本自动获取训练语料;然后,根据朴素贝叶斯分类原理优化训练语料;最后,基于条件随机场(CRF)建立属性关系抽取模型。在互动百科中采集的数据集上进行实验,综合评价F值达到了80.9%。结果表明该方法能够获得质量较高的训练语料,并取得良好的抽取性能。

关键词: 关系抽取, 弱监督, 中文百科, 朴素贝叶斯分类, 条件随机场

Abstract: In order to solve the problem of insufficient training corpus for extracting attribute relation from Chinese encyclopedia, a weakly supervised method was proposed, which needed minimal human intervention. First, semi-structured attribute relations from Chinese encyclopedia entry infoboxes were used to tag entry texts for obtaining training corpus. Second, the optimized training corpus was obtained based on Naive Bayesian theory. Third, Conditional Random Field (CRF) was used to form attribute relation extraction model. The evaluation of F-score on the Hudong encyclopedia datasets was 80.9%. The experimental result shows that this method can enhance the quality of training corpus and runs a better extraction performance.

Key words: relation extraction, weak supervision, Chinese encyclopedia, Naive Bayes classification, Conditional Random Field (CRF)

中图分类号: