Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network

doi:10.11772/j.issn.1001-9081.2022071124

Abstract

Abstract:

To address the problem of lack of labeled data in low-resource languages， which prevents the use of existing mature deep learning methods for Named Entity Recognition （NER）， a cross-lingual NER model based on sentence-level Generative Adversarial Network （GAN）， namely SLGAN-XLM-R （Sentence Level GAN based on XLM-R）， was proposed. Firstly， the labeled data of the source language was used to train the NER model on the basis of the pre-trained model XLM-R （XLM-Robustly optimized BERT pretraining approach）. At the same time， the linguistic adversarial training was performed on the embedding layer of XLM-R model by combining the unlabeled data of the target language. Then， the soft labels of the unlabeled data of the target language were predicted by using the NER model， Finally the labeled data of the source language and the target language was mixed to fine-tune the model again to obtain the final NER model. Experiments were conducted on four languages， English， German， Spanish， and Dutch， in two datasets， CoNLL2002 and CoNLL2003. The results show that with English as the source language， the F1 scores of SLGAN-XLM-R model on the test sets of German， Spanish， and Dutch are 72.70%， 79.42%， and 80.03%， respectively， which are 5.38， 5.38， and 3.05 percentage points higher compared to those of the direct fine-tuning on XLM-R model.

Key words: cross-language, Named Entity Recognition (NER), XLM-R (XLM-Robustly optimized BERT pretraining approach), linguistic adversarial training, pre-trained model

摘要：

针对低资源语言缺少标签数据，而无法使用现有成熟的深度学习方法进行命名实体识别（NER）的问题，提出基于句级别对抗生成网络（GAN）的跨语言NER模型——SLGAN-XLM-R（Sentence Level GAN Based on XLM-R）。首先，使用源语言的标签数据在预训练模型XLM-R （XLM-Robustly optimized BERT pretraining approach）的基础上训练NER模型；同时，结合目标语言的无标签数据对XLM-R模型的嵌入层进行语言对抗训练；然后，使用NER模型来预测目标语言无标签数据的软标签；最后，混合源语言与目标语言的标签数据，以对模型进行二次微调来得到最终的NER模型。在CoNLL2002和CoNLL2003两个数据集的英语、德语、西班牙语、荷兰语四种语言上的实验结果表明，以英语作为源语言时，SLGAN-XLM-R模型在德语、西班牙语、荷兰语测试集上的F1值分别为72.70%、79.42%、80.03%，相较于直接在XLM-R模型上进行微调分别提升了5.38、5.38、3.05个百分点。

关键词: 跨语言, 命名实体识别, XLM-R, 语言对抗训练, 预训练模型

CLC Number:

TP391.1

Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network[J]. Journal of Computer Applications, 2023, 43(8): 2406-2411.

张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.

Figures/Tables 7

References 30

1	BANERJEE P S， CHAKRABORTY B， TRIPATHI D， et al. A information retrieval based on question and answering and NER for unstructured information without using SQL［J］. Wireless Personal Communications， 2019， 108（3）： 1909-1931. 10.1007/s11277-019-06501-z
2	FABBRI A， NG P， WANG Z G， et al. Template-based question generation from retrieved sentences for improved unsupervised question answering［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 4508-4513. 10.18653/v1/2020.acl-main.413
3	NALLAPATI R， ZHOU B W， DOS SANTOS C， et al. Abstractive text summarization using sequence-to-sequence RNNs and beyond［C］// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg， PA： ACL， 2016： 280-290. 10.18653/v1/k16-1028
4	KRUENGKRAI C， NGUYEN T H， ALJUNIED S M， et al. Improving low-resource named entity recognition using joint sentence and token labeling［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 5898-5905. 10.18653/v1/2020.acl-main.523
5	SUN L F， YI L H， CHEN H X， et al. Back attention knowledge transfer for low-resource named entity recognition［EB/OL］. （2021-06-18）［2022-09-20］.. 10.5121/csit.2022.120625
6	LIU L L， DING B S， BING L D， et al. MulDA： a multilingual data augmentation framework for low-resource cross-lingual NER［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2021： 5834-5846. 10.18653/v1/2021.acl-long.453
7	JAIN A， PARANJAPE B， LIPTON Z C. Entity projection via machine translation for cross-lingual NER［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： ACL， 2019： 1083-1092. 10.18653/v1/d19-1100
8	DING B S， LIU L L， BING L D， et al. DAGA： data augmentation with a generation approach for low-resource tagging tasks［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 6045-6057. 10.18653/v1/2020.emnlp-main.488
9	BARI M S， JOTY S R， JWALAPURAM P. Zero-resource cross-lingual named entity recognition［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 7415-7423. 10.1609/aaai.v34i05.6237
10	KEUNG P， LU Y C， BHARDWAJ V. Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： ACL， 2019： 1355-1360. 10.18653/v1/d19-1138
11	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
12	WU Q H， LIN Z J， WANG G X， et al. Enhanced meta-learning for cross-lingual named entity recognition with minimal resources［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 9274-9281. 10.1609/aaai.v34i05.6466
13	PFEIFFER J， VULIĆ I， GUREVYCH I， et al. MAD-X： an adapter-based framework for multi-task cross-lingual transfer［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 7654-7673. 10.18653/v1/2020.emnlp-main.617
14	WU Q H， LIN Z J， KARLSSON B F， et al. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 6505-6514. 10.18653/v1/2020.acl-main.581
15	WU Q H， LIN Z J， KARLSSON B F， et al. UniTrans： unifying model transfer and data transfer for cross-lingual named entity recognition with unlabeled data［C］// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2020：3926-3932. 10.24963/ijcai.2020/543
16	YI H X， CHENG J. Zero-shot entity recognition via multi-source projection and unlabeled data［J］. IOP Conference Series： Earth and Environmental Science， 2021， 693： No.012084. 10.1088/1755-1315/693/1/012084
17	FU Y W， LIN N K， YANG Z Y， et al. A dual-contrastive framework for low-resource cross-lingual named entity recognition［EB/OL］. （2022-04-02）［2022-09-20］.. 10.18653/v1/2022.findings-emnlp.132
18	CONNEAU A， KHANDELWAL K， GOYAL N， et al. Unsupervised cross-lingual representation learning at scale［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 8440-8451. 10.18653/v1/2020.acl-main.747
19	TJONG KIM SANG E. Introduction to the CoNLL-2002 shared task： language-independent named entity recognition［C/OL］// Proceedings of the 6th Conference on Natural Language Learning 2002 ［2022-09-20］.. 10.3115/1118853.1118877
20	TJONG KIM SANG E， DE MEULDER F. Introduction to the CoNLL-2003 shared task： language-independent named entity recognition［C/OL］// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 ［2022-09-20］.. 10.3115/1119176.1119195
21	CONNEAU A， LAMPLE G. Cross-lingual language model pretraining［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 7059-7069. 10.18653/v1/d18-1269
22	LIU Y H， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach［EB/OL］. （2019-07-26）［2022-09-20］..
23	王倩，李茂西，吴水秀，等. 基于跨语种预训练语言模型XLM-R的神经机器翻译方法［J］. 北京大学学报（自然科学版）， 2022， 58（1）：29-36.
	WANG Q， LI M X， WU S X， et al. Neural machine translation based on XLM-R cross-lingual pre-training language model［J］. Acta Scientiarum Naturalium Universitatis Pekinensis， 2022， 58（1）：29-36.
24	CHEN W L， JIANG H Q， WU Q H， et al. AdvPicker： effectively leveraging unlabeled data via adversarial discriminator for cross-lingual NER［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2021： 743-753. 10.18653/v1/2021.acl-long.61
25	WU Y H， SCHUSTER M， CHEN Z F， et al. Google’s neural machine translation system： bridging the gap between human and machine translation［EB/OL］. （2016-10-08）［2022-09-20］..
26	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization［EB/OL］. （2019-01-04）［2022-09-20］..
27	TSAI C T， MAYHEW S， ROTH D. Cross-lingual named entity recognition via wikification［C］// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg， PA： ACL， 2016： 219-228. 10.18653/v1/k16-1022
28	NI J， DINU G， FLORIAN R. Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2017： 1470-1480. 10.18653/v1/p17-1135
29	MAYHEW S， TSAI C T， ROTH D. Cheap translation for cross-lingual named entity recognition［C］// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2017： 2536-2545. 10.18653/v1/d17-1269
30	WU S J， DREDZE M. Beto， Bentz， Becas： the surprising cross-lingual effectiveness of BERT［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： ACL， 2019： 833-844. 10.18653/v1/d19-1077

语言	类型	训练集	验证集	测试集
英语［en］（CoNLL2003）	句子	14 987	3 466	3 684
英语［en］（CoNLL2003）	实体	23 499	5 942	5 648
德语［de］（CoNLL2003）	句子	12 705	3 068	3 160
德语［de］（CoNLL2003）	实体	11 851	4 833	3 673
西班牙语［es］（CoNLL2002）	句子	8 323	1 915	1 517
西班牙语［es］（CoNLL2002）	实体	18 798	4 351	3 558
荷兰语［nl］（CoNLL2002）	句子	15 806	2 895	5 195
荷兰语［nl］（CoNLL2002）	实体	13 344	2 616	3 941

语言	类型	训练集	验证集	测试集
英语［en］（CoNLL2003）	句子	14 987	3 466	3 684
英语［en］（CoNLL2003）	实体	23 499	5 942	5 648
德语［de］（CoNLL2003）	句子	12 705	3 068	3 160
德语［de］（CoNLL2003）	实体	11 851	4 833	3 673
西班牙语［es］（CoNLL2002）	句子	8 323	1 915	1 517
西班牙语［es］（CoNLL2002）	实体	18 798	4 351	3 558
荷兰语［nl］（CoNLL2002）	句子	15 806	2 895	5 195
荷兰语［nl］（CoNLL2002）	实体	13 344	2 616	3 941

实体类型	标注符号	实体描述	示例
人名	B-PER	人名开始单词	终（B-PER）南（I-PER）山（I-PER）
人名	I-PER	人名其他单词	终（B-PER）南（I-PER）山（I-PER）
组织或公司	B-ORG	组织名的开始单词	华（B-ORG）为（I-ORG）
组织或公司	I-ORG	组织名的其他单词	华（B-ORG）为（I-ORG）
地点	B-LOC	地名开始单词	西（B-LOC）安（I-LOC）市（I-LOC）

	I-LOC	地名结束单词
其他实体	B-MISC	其他实体开始单词	疫（B-MISC）苗（I-MISC）
其他实体	I-MISC	其他实体结束单词	疫（B-MISC）苗（I-MISC）
非实体	O	其他非实体	你（O）好（O）

实体类型	标注符号	实体描述	示例
人名	B-PER	人名开始单词	终（B-PER）南（I-PER）山（I-PER）
人名	I-PER	人名其他单词	终（B-PER）南（I-PER）山（I-PER）
组织或公司	B-ORG	组织名的开始单词	华（B-ORG）为（I-ORG）
组织或公司	I-ORG	组织名的其他单词	华（B-ORG）为（I-ORG）
地点	B-LOC	地名开始单词	西（B-LOC）安（I-LOC）市（I-LOC）

	I-LOC	地名结束单词
其他实体	B-MISC	其他实体开始单词	疫（B-MISC）苗（I-MISC）
其他实体	I-MISC	其他实体结束单词	疫（B-MISC）苗（I-MISC）
非实体	O	其他非实体	你（O）好（O）

类别	样例
原始句子	EU rejects German call to boycott British lamb
原始标签	B-ORG O B-MISC O O O B-MISC O O
预处理后句子	_EU_rejects_German_call_to_boycott_British_lamb
预处理后实体标签	B-ORG O O O B-MISC O O O O O B-MISC O O O O
预处理后语言标签	0