Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (8): 2406-2411.DOI: 10.11772/j.issn.1001-9081.2022071124

• Artificial intelligence • Previous Articles    

Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network

Xiaoyan ZHANG, Zhengyu DUAN   

  1. College of Computer Science and Technology,Xi’an University of Science and Technology,Xi’an Shaanxi 710600,China
  • Received:2022-08-01 Revised:2022-11-04 Accepted:2022-11-11 Online:2023-01-15 Published:2023-08-10
  • Contact: Xiaoyan ZHANG
  • About author:DUAN Zhengyu, born in 1998, M. S. candidate. His research interests include deep learning, natural language processing.

基于句级别GAN的跨语言零资源命名实体识别模型

张小艳, 段正宇   

  1. 西安科技大学 计算机科学与技术学院,西安 710600
  • 通讯作者: 张小艳
  • 作者简介:段正宇(1998—),男,安徽安庆人,硕士研究生,主要研究方向:深度学习、自然语言处理。

Abstract:

To address the problem of lack of labeled data in low-resource languages, which prevents the use of existing mature deep learning methods for Named Entity Recognition (NER), a cross-lingual NER model based on sentence-level Generative Adversarial Network (GAN), namely SLGAN-XLM-R (Sentence Level GAN based on XLM-R), was proposed. Firstly, the labeled data of the source language was used to train the NER model on the basis of the pre-trained model XLM-R (XLM-Robustly optimized BERT pretraining approach). At the same time, the linguistic adversarial training was performed on the embedding layer of XLM-R model by combining the unlabeled data of the target language. Then, the soft labels of the unlabeled data of the target language were predicted by using the NER model, Finally the labeled data of the source language and the target language was mixed to fine-tune the model again to obtain the final NER model. Experiments were conducted on four languages, English, German, Spanish, and Dutch, in two datasets, CoNLL2002 and CoNLL2003. The results show that with English as the source language, the F1 scores of SLGAN-XLM-R model on the test sets of German, Spanish, and Dutch are 72.70%, 79.42%, and 80.03%, respectively, which are 5.38, 5.38, and 3.05 percentage points higher compared to those of the direct fine-tuning on XLM-R model.

Key words: cross-language, Named Entity Recognition (NER), XLM-R (XLM-Robustly optimized BERT pretraining approach), linguistic adversarial training, pre-trained model

摘要:

针对低资源语言缺少标签数据,而无法使用现有成熟的深度学习方法进行命名实体识别(NER)的问题,提出基于句级别对抗生成网络(GAN)的跨语言NER模型——SLGAN-XLM-R(Sentence Level GAN Based on XLM-R)。首先,使用源语言的标签数据在预训练模型XLM-R (XLM-Robustly optimized BERT pretraining approach)的基础上训练NER模型;同时,结合目标语言的无标签数据对XLM-R模型的嵌入层进行语言对抗训练;然后,使用NER模型来预测目标语言无标签数据的软标签;最后,混合源语言与目标语言的标签数据,以对模型进行二次微调来得到最终的NER模型。在CoNLL2002和CoNLL2003两个数据集的英语、德语、西班牙语、荷兰语四种语言上的实验结果表明,以英语作为源语言时,SLGAN-XLM-R模型在德语、西班牙语、荷兰语测试集上的F1值分别为72.70%、79.42%、80.03%,相较于直接在XLM-R模型上进行微调分别提升了5.38、5.38、3.05个百分点。

关键词: 跨语言, 命名实体识别, XLM-R, 语言对抗训练, 预训练模型

CLC Number: