Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (4): 1013-1016.DOI: 10.11772/j.issn.1001-9081.2015.04.1013

Previous Articles     Next Articles

Family relation extraction from Wikipedia by self-supervised learning

ZHU Suyang1,2, HUI Haotian1,2, QIAN Longhua1,2, ZHANG Min1,2   

  1. 1. Natural Language Processing Laboratory, Soochow University, Suzhou Jiangsu 215006, China;
    2. School of Computer Science and Technology, Soochow University, Suzhou Jiangsu 215006, China
  • Received:2014-10-27 Revised:2015-01-05 Online:2015-04-10 Published:2015-04-08

基于自监督学习的维基百科家庭关系抽取

朱苏阳1,2, 惠浩添1,2, 钱龙华1,2, 张民1,2   

  1. 1. 苏州大学 自然语言处理实验室, 江苏 苏州 215006;
    2. 苏州大学 计算机科学与技术学院, 江苏 苏州 215006
  • 通讯作者: 朱苏阳
  • 作者简介:朱苏阳(1989-),男,江苏苏州人,硕士研究生,主要研究方向:信息抽取; 惠浩添(1991-),男,江苏徐州人,硕士研究生,主要研究方向:信息抽取; 钱龙华(1966-),男,江苏苏州人,副教授,CCF会员,主要研究方向:自然语言处理; 张民(1970-),男,黑龙江哈尔滨人,教授,博士生导师,CCF会员,主要研究方向:机器翻译。
  • 基金资助:

    国家自然科学基金资助项目(61373096, 90920004);江苏省高校自然科学研究重大项目(11KJA520003)。

Abstract:

Traditional supervised relation extraction demands a large scale of manually annotated training data while semi-supervised learning suffers from low recall. A self-supervised learning based approach was proposed to extract personal family relationships. First, semi-structured information (family relation triples) was mapped to the free text in Chinese Wikipedia to automatically generate annotated training data. Then family relations between person entities were extracted from Wikipedia text with feature-based relation extraction method. The experimental results on a manually annotated test family network show that this method outperforms Bootstrapping with F1-measure of 77%, implying that self-supervised learning can effectively extract personal family relationships.

Key words: self-supervised learning, Wikipedia, semi-structured information, relation extraction

摘要:

传统有监督的关系抽取方法需要大量人工标注的训练语料,而半监督方法则召回率较低,对此提出了一种基于自监督学习来抽取人物家庭关系的方法。该方法首先将中文维基百科的半结构化信息——家庭关系三元组映射到自由文本中,从而自动生成已标注的训练语料;然后,使用基于特征的关系抽取方法从中文维基百科的文本中获取人物间的家庭关系。在一个人工标注的家庭关系网络测试集上的实验结果表明,该方法优于自举方法,其F1指数达到77%,说明自监督学习可以较为有效地抽取人物家庭关系。

关键词: 自监督学习, 维基百科, 半结构化信息, 关系抽取

CLC Number: