《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (10): 2990-2995.DOI: 10.11772/j.issn.1001-9081.2021081521

• 人工智能 • 上一篇    下一篇

基于数据增强和弱监督对抗训练的中文事件检测

罗萍1, 丁玲1, 杨雪2, 向阳1   

  1. 1.同济大学 电子与信息工程学院,上海 201804
    2.软通动力信息技术(集团)有限公司,河北 廊坊 065000
  • 收稿日期:2021-08-26 修回日期:2021-12-03 接受日期:2021-12-06 发布日期:2022-01-07 出版日期:2022-10-10
  • 通讯作者: 向阳
  • 作者简介:第一联系人:罗萍(1997—),女,安徽黄山人,硕士研究生,主要研究方向:自然语言处理、信息抽取、事件抽取
    丁玲(1995—),女,山东淄博人,博士研究生,CCF会员,主要研究方向:自然语言处理、信息抽取、事件抽取
    杨雪(1985—),女,河北廊坊人,主要研究方向:企业数字化、智慧城市
    向阳(1962—),男,上海人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘、自然语言处理。tjdxxiangyang@gmail.com
  • 基金资助:
    国家自然科学基金资助项目(72071145)

Chinese event detection based on data augmentation and weakly supervised adversarial training

Ping LUO1, Ling DING1, Xue YANG2, Yang XIANG1   

  1. 1.College of Electronics and Information Engineering,Tongji University,Shanghai 201804,China
    2.iSoftStone Information Technology (Group) Company Limited,Langfang Hebei 065000,China
  • Received:2021-08-26 Revised:2021-12-03 Accepted:2021-12-06 Online:2022-01-07 Published:2022-10-10
  • Contact: Yang XIANG
  • About author:LUO Ping, born in 1997, M. S. candidate. Her research interests include natural language processing, information extraction, event extraction.
    DING Ling, born in 1995, Ph. D. candidate. Her research interests include natural language processing, information extraction, event extraction.
    YANG Xue, born in 1985. Her research interests include enterprise digitalization, smart city.
    XIANG Yang, born in 1962, Ph. D. , professor. His research interests include machine learning, data mining, natural language processing.
  • Supported by:
    National Natural Science Foundation of China(72071145)

摘要:

当前的事件检测模型严重依赖于人工标注的数据,在标注数据规模有限的情况下,事件检测任务中基于完全监督方法的深度学习模型经常会出现过拟合的问题,而基于弱监督学习的使用自动标注数据代替耗时的人工标注数据的方法又常常依赖于复杂的预定义规则。为了解决上述问题,就中文事件检测任务提出了一种基于BERT的混合文本对抗训练(BMAD)方法。所提方法基于数据增强和对抗学习设定了弱监督学习场景,并采用跨度抽取模型来完成事件检测任务。首先,为改善数据不足的问题,采用回译、Mix-Text等数据增强方法来增强数据并为事件检测任务创建弱监督学习场景;然后,使用一种对抗训练机制进行噪声学习,力求最大限度地生成近似真实样本的生成样本,并最终提高整个模型的鲁棒性。在广泛使用的真实数据集自动文档抽取(ACE)2005上进行实验,结果表明相较于NPN、TLNN、HCBNN等算法,所提方法在F1分数上获取了至少0.84个百分点的提升。

关键词: 信息抽取, 中文事件检测, 数据增强, 弱监督学习, 对抗训练

Abstract:

The existing event detection models rely heavily on human-annotated data, and supervised deep learning models for event detection task often suffer from over-fitting when there is only limited labeled data. Methods of replacing time-consuming human annotation data with auto-labeled data typically rely on sophisticated pre-defined rules. To address these issues, a BERT (Bidirectional Encoder Representations from Transformers) based Mix-text ADversarial training (BMAD) method for Chinese event detection was proposed. In the proposed method, a weakly supervised learning scene was set on the basis of data augmentation and adversarial learning, and a span extraction model was used to solve event detection task. Firstly, to relieve the problem of insufficient data, various data augmentation methods such as back-translation and Mix-Text were applied to augment data and create weakly supervised learning scene for event detection. And then an adversarial training mechanism was applied to learn with noise and improve the robustness of the whole model. Several experiments were conducted on commonly used real-world dataset Automatic Context Extraction (ACE) 2005. The results show that compared with algorithms such as Nugget Proposal Network (NPN), Trigger-aware Lattice Neural Network (TLNN) and Hybrid-Character-Based Neural Network (HCBNN), the proposed method has the F1 score improved by at least 0.84 percentage points.

Key words: information extraction, Chinese event detection, data augmentation, weakly supervised learning, adversarial training

中图分类号: