计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 455-459.DOI: 10.11772/j.issn.1001-9081.2016.02.0455

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇    下一篇

基于远距离监督和模式匹配的职衔履历属性抽取

于东1,2, 刘春花2, 田悦2   

  1. 1. 北京语言大学 大数据与语言教育研究所, 北京 100083;
    2. 北京语言大学 信息科学学院, 北京 100083
  • 收稿日期:2015-09-15 修回日期:2015-09-22 出版日期:2016-02-10 发布日期:2016-02-03
  • 通讯作者: 于东(1982-),男,山东日照人,助理研究员,博士,主要研究方向:计算语言学、语义分析、人工智能。
  • 作者简介:刘春花(1994-),女,重庆人,主要研究方向:自然语言处理;田悦(1994-),女,河南焦作人,主要研究方向:自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(61300081);中央高校基本科研业务费专项资金资助项目(北京语言大学科研项目:15YJ030006)。

Personal title and career attributes extraction based on distant supervision and pattern matching

YU Dong1,2, LIU Chunhua2, TIAN Yue2   

  1. 1. Institute of Big Data and Language Education, Beijing Language and Culture University, Beijing 100083, China;
    2. School of Information Science, Beijing Language and Culture University, Beijing 100083, China
  • Received:2015-09-15 Revised:2015-09-22 Online:2016-02-10 Published:2016-02-03

摘要: 针对从非结构化文本中抽取指定人物职衔履历属性问题,提出一种基于远距离监督和模式匹配的属性抽取方法。该方法从字符串模式和依存模式两个层面描述人物职衔履历特征,将问题分为两阶段。首先利用远距离监督知识和人工标注知识,挖掘具有高覆盖度的模式库,用于发现职衔履历属性和抽取候选集;其次利用职衔机构等属性间的文字接续关系,以及特定人物与候选属性的依存关系,设计候选集的过滤规则对候选项进行筛选,实现高准确度的属性抽取。实验结果显示,所提方法在CLP2014-PAE测试集上的F值达到55.37%,显著高于评测最好成绩(F值34.38%)和基于条件随机场(CRF)的有监督序列标注方法(F值43.79%),表明该方法能高覆盖度挖掘并抽取非结构化文档中的职衔履历属性。

关键词: 人物属性抽取, 职衔履历信息, 远距离监督, 模式匹配, 规则过滤

Abstract: Focusing on the issue of extracting title and career attributes from unstructured text for specific person, an distant supervision and pattern matching based method was proposed. Features of personal attributes were described from two aspects of string pattern and dependency pattern. Title and career attributes were extracted by two stages. At first, both distant supervision and human annotated knowledge were used to build high coverage pattern base to discover and extract a candidate attribute set. Then the literal connections among multiple attributes and dependency relations between the specific person and candidate attributes were used to design a filtering rule set. Test on CLP-2014 PAE share task shows that the F-score of the proposed method reaches 55.37%, which is significantly higher than the best result of the evaluation (F-measure 34.38%), and it also outperforms the method based on supervised Conditional Random Field (CRF) sequence tagging method with F-measure of 43.79%. The experimental results show that by carrying out a filter process, the proposed method can mine and extract title and career attributes from unstructured document with a high coverage rate.

Key words: personal attributes extraction, title and career information, distant supervision, pattern matching, rule filtering

中图分类号: