《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2406-2411.DOI: 10.11772/j.issn.1001-9081.2022071124
所属专题: 人工智能
张小艳, 段正宇
收稿日期:
2022-08-01
修回日期:
2022-11-04
接受日期:
2022-11-11
发布日期:
2023-01-15
出版日期:
2023-08-10
通讯作者:
张小艳
作者简介:
段正宇(1998—),男,安徽安庆人,硕士研究生,主要研究方向:深度学习、自然语言处理。
Received:
2022-08-01
Revised:
2022-11-04
Accepted:
2022-11-11
Online:
2023-01-15
Published:
2023-08-10
Contact:
Xiaoyan ZHANG
About author:
DUAN Zhengyu, born in 1998, M. S. candidate. His research interests include deep learning, natural language processing.
摘要:
针对低资源语言缺少标签数据,而无法使用现有成熟的深度学习方法进行命名实体识别(NER)的问题,提出基于句级别对抗生成网络(GAN)的跨语言NER模型——SLGAN-XLM-R(Sentence Level GAN Based on XLM-R)。首先,使用源语言的标签数据在预训练模型XLM-R (XLM-Robustly optimized BERT pretraining approach)的基础上训练NER模型;同时,结合目标语言的无标签数据对XLM-R模型的嵌入层进行语言对抗训练;然后,使用NER模型来预测目标语言无标签数据的软标签;最后,混合源语言与目标语言的标签数据,以对模型进行二次微调来得到最终的NER模型。在CoNLL2002和CoNLL2003两个数据集的英语、德语、西班牙语、荷兰语四种语言上的实验结果表明,以英语作为源语言时,SLGAN-XLM-R模型在德语、西班牙语、荷兰语测试集上的F1值分别为72.70%、79.42%、80.03%,相较于直接在XLM-R模型上进行微调分别提升了5.38、5.38、3.05个百分点。
中图分类号:
张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 计算机应用, 2023, 43(8): 2406-2411.
Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network[J]. Journal of Computer Applications, 2023, 43(8): 2406-2411.
语言 | 类型 | 训练集 | 验证集 | 测试集 |
---|---|---|---|---|
英语[en] (CoNLL2003) | 句子 | 14 987 | 3 466 | 3 684 |
实体 | 23 499 | 5 942 | 5 648 | |
德语[de] (CoNLL2003) | 句子 | 12 705 | 3 068 | 3 160 |
实体 | 11 851 | 4 833 | 3 673 | |
西班牙语[es] (CoNLL2002) | 句子 | 8 323 | 1 915 | 1 517 |
实体 | 18 798 | 4 351 | 3 558 | |
荷兰语[nl] (CoNLL2002) | 句子 | 15 806 | 2 895 | 5 195 |
实体 | 13 344 | 2 616 | 3 941 |
表1 数据集统计信息
Tab. 1 Statistics of datasets
语言 | 类型 | 训练集 | 验证集 | 测试集 |
---|---|---|---|---|
英语[en] (CoNLL2003) | 句子 | 14 987 | 3 466 | 3 684 |
实体 | 23 499 | 5 942 | 5 648 | |
德语[de] (CoNLL2003) | 句子 | 12 705 | 3 068 | 3 160 |
实体 | 11 851 | 4 833 | 3 673 | |
西班牙语[es] (CoNLL2002) | 句子 | 8 323 | 1 915 | 1 517 |
实体 | 18 798 | 4 351 | 3 558 | |
荷兰语[nl] (CoNLL2002) | 句子 | 15 806 | 2 895 | 5 195 |
实体 | 13 344 | 2 616 | 3 941 |
实体类型 | 标注符号 | 实体描述 | 示例 |
---|---|---|---|
人名 | B-PER | 人名开始单词 | 终(B-PER) 南(I-PER) 山(I-PER) |
I-PER | 人名其他单词 | ||
组织或 公司 | B-ORG | 组织名的开始单词 | 华(B-ORG) 为(I-ORG) |
I-ORG | 组织名的其他单词 | ||
地点 | B-LOC | 地名开始单词 | 西(B-LOC) 安(I-LOC) 市(I-LOC) |
I-LOC | 地名结束单词 | ||
其他 实体 | B-MISC | 其他实体开始单词 | 疫(B-MISC) 苗(I-MISC) |
I-MISC | 其他实体结束单词 | ||
非实体 | O | 其他非实体 | 你(O)好(O) |
表2 实体标注方案
Tab. 2 Named entity labeling scheme
实体类型 | 标注符号 | 实体描述 | 示例 |
---|---|---|---|
人名 | B-PER | 人名开始单词 | 终(B-PER) 南(I-PER) 山(I-PER) |
I-PER | 人名其他单词 | ||
组织或 公司 | B-ORG | 组织名的开始单词 | 华(B-ORG) 为(I-ORG) |
I-ORG | 组织名的其他单词 | ||
地点 | B-LOC | 地名开始单词 | 西(B-LOC) 安(I-LOC) 市(I-LOC) |
I-LOC | 地名结束单词 | ||
其他 实体 | B-MISC | 其他实体开始单词 | 疫(B-MISC) 苗(I-MISC) |
I-MISC | 其他实体结束单词 | ||
非实体 | O | 其他非实体 | 你(O)好(O) |
类别 | 样例 |
---|---|
原始句子 | EU rejects German call to boycott British lamb |
原始标签 | B-ORG O B-MISC O O O B-MISC O O |
预处理后句子 | _EU_rejects_German_call_to_boycott_British_lamb |
预处理后 实体标签 | B-ORG O O O B-MISC O O O O O B-MISC O O O O |
预处理后 语言标签 | 0 |
表3 数据预处理示例
Tab. 3 Example of data pre-processing
类别 | 样例 |
---|---|
原始句子 | EU rejects German call to boycott British lamb |
原始标签 | B-ORG O B-MISC O O O B-MISC O O |
预处理后句子 | _EU_rejects_German_call_to_boycott_British_lamb |
预处理后 实体标签 | B-ORG O O O B-MISC O O O O O B-MISC O O O O |
预处理后 语言标签 | 0 |
模型 | 德语 | 西班牙语 | 荷兰语 | 平均值 |
---|---|---|---|---|
文献[ | 48.12 | 60.55 | 61.56 | 56.74 |
文献[ | 58.50 | 65.10 | 65.40 | 63.00 |
文献[ | 57.23 | 64.10 | 63.37 | 61.57 |
文献[ | 61.50 | 73.50 | 69.90 | 68.30 |
文献[ | 65.24 | 75.93 | 74.61 | 71.93 |
文献[ | 69.56 | 74.96 | 77.57 | 73.57 |
文献[ | 71.90 | 74.30 | 77.60 | 74.60 |
本文模型(SLGAN+NER) | 69.51 | 78.32 | 78.71 | 75.51 |
本文模型(SLGAN-XLM-R) | 72.70 | 79.42 | 80.03 | 76.00 |
表4 不同跨语言模型的F1值识别结果对比 (%)
Tab. 4 F1 scores comparison of recognition results of different cross-lingual language models
模型 | 德语 | 西班牙语 | 荷兰语 | 平均值 |
---|---|---|---|---|
文献[ | 48.12 | 60.55 | 61.56 | 56.74 |
文献[ | 58.50 | 65.10 | 65.40 | 63.00 |
文献[ | 57.23 | 64.10 | 63.37 | 61.57 |
文献[ | 61.50 | 73.50 | 69.90 | 68.30 |
文献[ | 65.24 | 75.93 | 74.61 | 71.93 |
文献[ | 69.56 | 74.96 | 77.57 | 73.57 |
文献[ | 71.90 | 74.30 | 77.60 | 74.60 |
本文模型(SLGAN+NER) | 69.51 | 78.32 | 78.71 | 75.51 |
本文模型(SLGAN-XLM-R) | 72.70 | 79.42 | 80.03 | 76.00 |
模型 | 训练方式 | 德语 | 西班牙语 | 荷兰语 |
---|---|---|---|---|
mBERT | 直接微调 | 62.34 | 69.70 | 68.52 |
对抗训练 | 63.97 | 72.43 | 71.49 | |
二次微调 | 66.30 | 74.46 | 72.96 | |
XLM-R | 直接微调 | 67.32 | 74.04 | 76.98 |
对抗训练 | 69.51 | 78.32 | 78.71 | |
二次微调 | 72.70 | 79.42 | 80.03 |
表5 不同PLM的F1值识别结果对比 (%)
Tab.5 F1 scores comparison of recognition results of different PLMs
模型 | 训练方式 | 德语 | 西班牙语 | 荷兰语 |
---|---|---|---|---|
mBERT | 直接微调 | 62.34 | 69.70 | 68.52 |
对抗训练 | 63.97 | 72.43 | 71.49 | |
二次微调 | 66.30 | 74.46 | 72.96 | |
XLM-R | 直接微调 | 67.32 | 74.04 | 76.98 |
对抗训练 | 69.51 | 78.32 | 78.71 | |
二次微调 | 72.70 | 79.42 | 80.03 |
1 | BANERJEE P S, CHAKRABORTY B, TRIPATHI D, et al. A information retrieval based on question and answering and NER for unstructured information without using SQL[J]. Wireless Personal Communications, 2019, 108(3): 1909-1931. 10.1007/s11277-019-06501-z |
2 | FABBRI A, NG P, WANG Z G, et al. Template-based question generation from retrieved sentences for improved unsupervised question answering[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4508-4513. 10.18653/v1/2020.acl-main.413 |
3 | NALLAPATI R, ZHOU B W, DOS SANTOS C, et al. Abstractive text summarization using sequence-to-sequence RNNs and beyond[C]// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, PA: ACL, 2016: 280-290. 10.18653/v1/k16-1028 |
4 | KRUENGKRAI C, NGUYEN T H, ALJUNIED S M, et al. Improving low-resource named entity recognition using joint sentence and token labeling[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 5898-5905. 10.18653/v1/2020.acl-main.523 |
5 | SUN L F, YI L H, CHEN H X, et al. Back attention knowledge transfer for low-resource named entity recognition[EB/OL]. (2021-06-18) [2022-09-20].. 10.5121/csit.2022.120625 |
6 | LIU L L, DING B S, BING L D, et al. MulDA: a multilingual data augmentation framework for low-resource cross-lingual NER[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 5834-5846. 10.18653/v1/2021.acl-long.453 |
7 | JAIN A, PARANJAPE B, LIPTON Z C. Entity projection via machine translation for cross-lingual NER[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 1083-1092. 10.18653/v1/d19-1100 |
8 | DING B S, LIU L L, BING L D, et al. DAGA: data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 6045-6057. 10.18653/v1/2020.emnlp-main.488 |
9 | BARI M S, JOTY S R, JWALAPURAM P. Zero-resource cross-lingual named entity recognition[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 7415-7423. 10.1609/aaai.v34i05.6237 |
10 | KEUNG P, LU Y C, BHARDWAJ V. Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 1355-1360. 10.18653/v1/d19-1138 |
11 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 4171-4186. 10.18653/v1/n18-2 |
12 | WU Q H, LIN Z J, WANG G X, et al. Enhanced meta-learning for cross-lingual named entity recognition with minimal resources[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 9274-9281. 10.1609/aaai.v34i05.6466 |
13 | PFEIFFER J, VULIĆ I, GUREVYCH I, et al. MAD-X: an adapter-based framework for multi-task cross-lingual transfer[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 7654-7673. 10.18653/v1/2020.emnlp-main.617 |
14 | WU Q H, LIN Z J, KARLSSON B F, et al. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 6505-6514. 10.18653/v1/2020.acl-main.581 |
15 | WU Q H, LIN Z J, KARLSSON B F, et al. UniTrans: unifying model transfer and data transfer for cross-lingual named entity recognition with unlabeled data[C]// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2020:3926-3932. 10.24963/ijcai.2020/543 |
16 | YI H X, CHENG J. Zero-shot entity recognition via multi-source projection and unlabeled data[J]. IOP Conference Series: Earth and Environmental Science, 2021, 693: No.012084. 10.1088/1755-1315/693/1/012084 |
17 | FU Y W, LIN N K, YANG Z Y, et al. A dual-contrastive framework for low-resource cross-lingual named entity recognition[EB/OL]. (2022-04-02) [2022-09-20].. 10.18653/v1/2022.findings-emnlp.132 |
18 | CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 8440-8451. 10.18653/v1/2020.acl-main.747 |
19 | TJONG KIM SANG E. Introduction to the CoNLL-2002 shared task: language-independent named entity recognition[C/OL]// Proceedings of the 6th Conference on Natural Language Learning 2002 [2022-09-20].. 10.3115/1118853.1118877 |
20 | TJONG KIM SANG E, DE MEULDER F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition[C/OL]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 [2022-09-20].. 10.3115/1119176.1119195 |
21 | CONNEAU A, LAMPLE G. Cross-lingual language model pretraining[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2019: 7059-7069. 10.18653/v1/d18-1269 |
22 | LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. (2019-07-26) [2022-09-20].. |
23 | 王倩,李茂西,吴水秀,等. 基于跨语种预训练语言模型XLM-R的神经机器翻译方法[J]. 北京大学学报(自然科学版), 2022, 58(1):29-36. |
WANG Q, LI M X, WU S X, et al. Neural machine translation based on XLM-R cross-lingual pre-training language model[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2022, 58(1):29-36. | |
24 | CHEN W L, JIANG H Q, WU Q H, et al. AdvPicker: effectively leveraging unlabeled data via adversarial discriminator for cross-lingual NER[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 743-753. 10.18653/v1/2021.acl-long.61 |
25 | WU Y H, SCHUSTER M, CHEN Z F, et al. Google’s neural machine translation system: bridging the gap between human and machine translation[EB/OL]. (2016-10-08) [2022-09-20].. |
26 | LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. (2019-01-04) [2022-09-20].. |
27 | TSAI C T, MAYHEW S, ROTH D. Cross-lingual named entity recognition via wikification[C]// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, PA: ACL, 2016: 219-228. 10.18653/v1/k16-1022 |
28 | NI J, DINU G, FLORIAN R. Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2017: 1470-1480. 10.18653/v1/p17-1135 |
29 | MAYHEW S, TSAI C T, ROTH D. Cheap translation for cross-lingual named entity recognition[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 2536-2545. 10.18653/v1/d17-1269 |
30 | WU S J, DREDZE M. Beto, Bentz, Becas: the surprising cross-lingual effectiveness of BERT[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 833-844. 10.18653/v1/d19-1077 |
[1] | 薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392. |
[2] | 李晨阳, 张龙, 郑秋生, 钱少华. 基于扩散序列的多元可控文本生成[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2414-2420. |
[3] | 孙焕良, 王思懿, 刘俊岭, 许景科. 社交媒体数据中水灾事件求助信息提取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2437-2445. |
[4] | 赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1775-1780. |
[5] | 于右任, 张仰森, 蒋玉茹, 黄改娟. 融合多粒度语言知识与层级信息的中文命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1706-1712. |
[6] | 孙祥杰, 魏强, 王奕森, 杜江. 代码相似性检测技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1248-1258. |
[7] | 余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714. |
[8] | 董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708. |
[9] | 罗歆然, 李天瑞, 贾真. 基于自注意力机制与词汇增强的中文医学命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 385-392. |
[10] | 黄子麒, 胡建鹏. 实体类别增强的汽车领域嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 377-384. |
[11] | 王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417. |
[12] | 林翔, 金彪, 尤玮婧, 姚志强, 熊金波. 基于脆弱指纹的深度神经网络模型完整性验证框架[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3479-3486. |
[13] | 张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759. |
[14] | 于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740. |
[15] | 田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||