Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (9): 2790-2797.DOI: 10.11772/j.issn.1001-9081.2024081143
• Artificial intelligence • Previous Articles
Li LI(), Han SONG, Peihe LIU, Hanlin CHEN
Received:
2024-08-14
Revised:
2024-10-16
Accepted:
2024-10-22
Online:
2024-11-07
Published:
2025-09-10
Contact:
Li LI
About author:
SONG Han, born in 2000, M. S. candidate. His research interests include artificial intelligence, natural language processing.Supported by:
通讯作者:
李莉
作者简介:
宋涵(2000—),男,山东菏泽人,硕士研究生,主要研究方向:人工智能、自然语言处理基金资助:
CLC Number:
Li LI, Han SONG, Peihe LIU, Hanlin CHEN. Named entity recognition for sensitive information based on data augmentation and residual networks[J]. Journal of Computer Applications, 2025, 45(9): 2790-2797.
李莉, 宋涵, 刘培鹤, 陈汉林. 基于数据增强和残差网络的敏感信息命名实体识别[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2790-2797.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024081143
实体类型 | 代号 | 解释 |
---|---|---|
国籍 | CONT | 一个人或组织所属的国家或国家地区 |
教育背景 | EDU | 通常用来表示一个人的教育程度、学历和学习经历等信息 |
地名 | LOC | 一种特定类型的实体,用来表示地理位置或地理实体的名称 |
人名 | NAME | 指代具体的个人,可以是真实存在人的姓名,也可是虚构的人物 |
组织名 | ORG | 公司、政府机构、非营利组织、团体、学校等组织或机构 |
专业 | PRO | 用于描述某人所学习或从事的特定领域或专业知识 |
民族 | RACE | 涉及个人身份或特定群体身份的敏感信息之一 |
职称 | TITLE | 在职场或社会中所担任的特定职务或头衔,与个人的身份、责任等密切相关 |
Tab. 1 Named entity specific types, designations, and explanations
实体类型 | 代号 | 解释 |
---|---|---|
国籍 | CONT | 一个人或组织所属的国家或国家地区 |
教育背景 | EDU | 通常用来表示一个人的教育程度、学历和学习经历等信息 |
地名 | LOC | 一种特定类型的实体,用来表示地理位置或地理实体的名称 |
人名 | NAME | 指代具体的个人,可以是真实存在人的姓名,也可是虚构的人物 |
组织名 | ORG | 公司、政府机构、非营利组织、团体、学校等组织或机构 |
专业 | PRO | 用于描述某人所学习或从事的特定领域或专业知识 |
民族 | RACE | 涉及个人身份或特定群体身份的敏感信息之一 |
职称 | TITLE | 在职场或社会中所担任的特定职务或头衔,与个人的身份、责任等密切相关 |
实体类型 | 头实体 | 实体内部 | 尾实体 | 单实体 |
---|---|---|---|---|
国籍 | B-CONT | I-CONT | E-CONT | S-CONT |
教育背景 | B-EDU | I-EDU | E-EDU | S-EDU |
地名 | B-LOC | I-LOC | E-LOC | S-LOC |
人名 | B-NAME | I-NAME | E-NAME | S-NAME |
组织名 | B-ORG | I-ORG | E-ORG | S-ORG |
专业 | B-PRO | I-PRO | E-PRO | S-PRO |
民族 | B-RACE | I-RACE | E-RACE | S-RACE |
职称 | B-TITLE | I-TITLE | E-TITLE | S-TITLE |
Tab. 2 BIOES entity labeling rules
实体类型 | 头实体 | 实体内部 | 尾实体 | 单实体 |
---|---|---|---|---|
国籍 | B-CONT | I-CONT | E-CONT | S-CONT |
教育背景 | B-EDU | I-EDU | E-EDU | S-EDU |
地名 | B-LOC | I-LOC | E-LOC | S-LOC |
人名 | B-NAME | I-NAME | E-NAME | S-NAME |
组织名 | B-ORG | I-ORG | E-ORG | S-ORG |
专业 | B-PRO | I-PRO | E-PRO | S-PRO |
民族 | B-RACE | I-RACE | E-RACE | S-RACE |
职称 | B-TITLE | I-TITLE | E-TITLE | S-TITLE |
实体类型 | 标注数 | 实体类型 | 标注数 |
---|---|---|---|
国籍(CONT) | 1 361 | 组织名(ORG) | 5 810 |
教育背景(EDU) | 510 | 专业(PRO) | 687 |
地名(LOC) | 547 | 民族(RACE) | 469 |
人名(NAME) | 1 258 | 职称(TITLE) | 7 908 |
Tab. 3 Numbers of labeled entities
实体类型 | 标注数 | 实体类型 | 标注数 |
---|---|---|---|
国籍(CONT) | 1 361 | 组织名(ORG) | 5 810 |
教育背景(EDU) | 510 | 专业(PRO) | 687 |
地名(LOC) | 547 | 民族(RACE) | 469 |
人名(NAME) | 1 258 | 职称(TITLE) | 7 908 |
超参数 | 取值 |
---|---|
Dropout | 0.5 |
Epoch | 30 |
Batch_size | 64 |
LSTM隐藏层维度 | 768 |
序列最大长度 | 512 |
学习率 |
Tab. 4 Hyperparameter values
超参数 | 取值 |
---|---|
Dropout | 0.5 |
Epoch | 30 |
Batch_size | 64 |
LSTM隐藏层维度 | 768 |
序列最大长度 | 512 |
学习率 |
数据集简历数 | 增强方案 | P/% | R/% | F1分数/% |
---|---|---|---|---|
100 | 原数据 | 72.77 | 67.66 | 70.26 |
E-MLM | 75.21 | 70.03 | 72.54 | |
同类实体替换 | 73.37 | 68.87 | 71.06 | |
实体上下文替换 | 74.85 | 67.60 | 71.02 | |
实体上下文删除 | 72.41 | 68.14 | 70.20 | |
200 | 原数据 | 79.47 | 74.76 | 77.06 |
E-MLM | 82.83 | 77.21 | 80.16 | |
同类实体替换 | 80.69 | 75.90 | 78.31 | |
实体上下文替换 | 81.33 | 74.56 | 77.87 | |
实体上下文删除 | 79.68 | 75.09 | 77.36 | |
400 | 原数据 | 85.53 | 80.82 | 82.99 |
E-MLM | 88.92 | 84.45 | 86.62 | |
同类实体替换 | 86.71 | 81.98 | 84.48 | |
实体上下文替换 | 87.21 | 80.66 | 83.86 | |
实体上下文删除 | 85.62 | 81.12 | 83.53 |
Tab. 5 Effects of data enhancement under different methods
数据集简历数 | 增强方案 | P/% | R/% | F1分数/% |
---|---|---|---|---|
100 | 原数据 | 72.77 | 67.66 | 70.26 |
E-MLM | 75.21 | 70.03 | 72.54 | |
同类实体替换 | 73.37 | 68.87 | 71.06 | |
实体上下文替换 | 74.85 | 67.60 | 71.02 | |
实体上下文删除 | 72.41 | 68.14 | 70.20 | |
200 | 原数据 | 79.47 | 74.76 | 77.06 |
E-MLM | 82.83 | 77.21 | 80.16 | |
同类实体替换 | 80.69 | 75.90 | 78.31 | |
实体上下文替换 | 81.33 | 74.56 | 77.87 | |
实体上下文删除 | 79.68 | 75.09 | 77.36 | |
400 | 原数据 | 85.53 | 80.82 | 82.99 |
E-MLM | 88.92 | 84.45 | 86.62 | |
同类实体替换 | 86.71 | 81.98 | 84.48 | |
实体上下文替换 | 87.21 | 80.66 | 83.86 | |
实体上下文删除 | 85.62 | 81.12 | 83.53 |
模型 | P | R | F1分数 |
---|---|---|---|
BiLSTM-CRF | 89.06 | 91.53 | 90.28 |
BERT-BiLSTM-CRF | 93.72 | 93.38 | 93.54 |
ALBERT-BiLSTM-CRF | 94.32 | 94.27 | 93.91 |
LEBERT-BiLSTM-CRF | 94.54 | 94.33 | 94.43 |
RoBERTa-WWM-BiLSTM-CRF | 95.53 | 94.65 | 95.09 |
RoBERTa-ResBiLSTM-CRF | 95.97 | 96.37 | 96.16 |
Tab. 6 Comparison of effects of different NER methods
模型 | P | R | F1分数 |
---|---|---|---|
BiLSTM-CRF | 89.06 | 91.53 | 90.28 |
BERT-BiLSTM-CRF | 93.72 | 93.38 | 93.54 |
ALBERT-BiLSTM-CRF | 94.32 | 94.27 | 93.91 |
LEBERT-BiLSTM-CRF | 94.54 | 94.33 | 94.43 |
RoBERTa-WWM-BiLSTM-CRF | 95.53 | 94.65 | 95.09 |
RoBERTa-ResBiLSTM-CRF | 95.97 | 96.37 | 96.16 |
数据集 | 数据集 | P | R | F1分数 |
---|---|---|---|---|
SenResume | 原始数据集 | 95.97 | 96.37 | 96.16 |
1倍扩充数据集 | 97.48 | 98.23 | 97.84 | |
原始数据集 | 72.75 | 73.37 | 73.06 | |
1倍扩充数据集 | 80.42 | 80.11 | 80.26 | |
MSRA | 原始数据集 | 91.27 | 90.89 | 91.08 |
1倍扩充数据集 | 93.55 | 93.20 | 93.37 | |
CLUENER2020 | 原始数据集 | 82.34 | 82.77 | 82.55 |
1倍扩充数据集 | 85.73 | 86.02 | 85.87 |
Tab. 7 Comparison of overall recognition effects of proposed model on different datasets
数据集 | 数据集 | P | R | F1分数 |
---|---|---|---|---|
SenResume | 原始数据集 | 95.97 | 96.37 | 96.16 |
1倍扩充数据集 | 97.48 | 98.23 | 97.84 | |
原始数据集 | 72.75 | 73.37 | 73.06 | |
1倍扩充数据集 | 80.42 | 80.11 | 80.26 | |
MSRA | 原始数据集 | 91.27 | 90.89 | 91.08 |
1倍扩充数据集 | 93.55 | 93.20 | 93.37 | |
CLUENER2020 | 原始数据集 | 82.34 | 82.77 | 82.55 |
1倍扩充数据集 | 85.73 | 86.02 | 85.87 |
实验序号 | 训练集 | 测试集 | P | R | F1分数 |
---|---|---|---|---|---|
1 | SenResume | SenResume | 95.97 | 96.37 | 96.16 |
2 | 人民日报 | 人民日报 | 96.04 | 95.30 | 95.67 |
3 | 人民日报 | SenResume | 82.52 | 83.20 | 82.96 |
4 | SenResume | 人民日报 | 92.15 | 90.75 | 91.44 |
5 | 1倍扩充人民日报 | SenResume | 85.25 | 85.87 | 84.02 |
6 | 1倍扩充SenResume | 人民日报 | 94.15 | 93.75 | 91.44 |
Tab. 8 Cross-domain and same-domain training and test results for datasets
实验序号 | 训练集 | 测试集 | P | R | F1分数 |
---|---|---|---|---|---|
1 | SenResume | SenResume | 95.97 | 96.37 | 96.16 |
2 | 人民日报 | 人民日报 | 96.04 | 95.30 | 95.67 |
3 | 人民日报 | SenResume | 82.52 | 83.20 | 82.96 |
4 | SenResume | 人民日报 | 92.15 | 90.75 | 91.44 |
5 | 1倍扩充人民日报 | SenResume | 85.25 | 85.87 | 84.02 |
6 | 1倍扩充SenResume | 人民日报 | 94.15 | 93.75 | 91.44 |
[1] | 杜晋华,尹浩,冯嵩. 中文电子病历命名实体识别的研究与进展[J]. 电子学报, 2022, 50(12): 3030-3053. |
DU J H, YIN H, FENG S. Research and development of named entity recognition in Chinese electronic medical record [J]. Acta Electronica Sinica, 2022, 50(12): 3030-3053. | |
[2] | CHU J, LIU Y, YUE Q, et al. Named entity recognition in aerospace based on multi-feature fusion Transformer [J]. Scientific Reports, 2024, 14: No.827. |
[3] | LANDOLSI M Y, ROMDHANE L BEN, HLAOUA L. Hybrid medical named entity recognition using document structure and surrounding context [J]. The Journal of Supercomputing, 2024, 80(4): 5011-5041. |
[4] | LI J. Support vector machine and hidden Markov model in name entity recognition of natural language processing [J]. Science and Technology of Engineering, Chemistry and Environmental Protection, 2024, 2(3): No.E001472. |
[5] | XU Y, MAO C, WANG Z, et al. Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books [J]. Scientific Reports, 2024, 14: No.17488. |
[6] | COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch [J]. Journal of Machine Learning Research, 2011, 12: 2493-2537. |
[7] | HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging [EB/OL]. [2024-06-15].. |
[8] | LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2016: 260-270. |
[9] | MA X, HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2016: 1064-1074. |
[10] | ZHANG Y, YANG J. Chinese NER using lattice LSTM [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2018: 1554-1564. |
[11] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. |
[12] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
[13] | 曾兰兰,王以松,陈攀峰. 基于BERT和联合学习的裁判文书命名实体识别[J]. 计算机应用, 2022, 42(10): 3011-3017. |
ZENG L L, WANG Y S, CHEN P F. Named entity recognition based on BERT and joint learning for judgment documents [J]. Journal of Computer Applications, 2022, 42(10): 3011-3017. | |
[14] | 郑立瑞,肖晓霞,邹北骥,等. 基于BERT的电子病历命名实体识别[J]. 计算机与现代化, 2024(1): 87-91. |
ZHENG L R, XIAO X X, ZOU B J, et al. Named entity recognition in electronic medical record based on BERT [J]. Computer and Modernization, 2024(1):87-91. | |
[15] | WEI J, ZOU K. EDA: easy data augmentation techniques for boosting performance on text classification tasks [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 6382-6388. |
[16] | FENG S Y, GANGAL V, KANG D, et al. GenAug: data augmentation for finetuning text generators [C]// Proceedings of Deep Learning Inside Out (DeeLIO): The 1st Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Stroudsburg: ACL, 2020: 29-42. |
[17] | ANDREAS J. Good-enough compositional data augmentation [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7556-7566. |
[18] | KARIMI A, ROSSI L, PRATI A. AEDA: an easier data augmentation technique for text classification [C]// Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg: ACL, 2021: 2748-2754. |
[19] | LYU S, SUN L, YI H, et al. Converse attention knowledge transfer for low-resource named entity recognition [J]. International Journal of Crowd Science, 2024, 8(3): 140-148. |
[20] | JAIN A, PARANJAPE B, LIPTON Z C. Entity projection via machine translation for cross-lingual NER [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 1083-1092. |
[21] | PAN J, ZHANG C, WANG H, et al. A comparative study of Chinese named entity recognition with different segment representations [J]. Applied Intelligence, 2022, 52(11): 12457-12469. |
[22] | QI D, WANG B, ZHAO Q, et al. Research on the spatial network structure of tourist flows in Hangzhou based on BERT-BiLSTM-CRF [J]. ISPRS International Journal of Geo-Information, 2024, 13(4): No.139. |
[23] | 余丹丹,黄洁,党同心,等. 基于ALBERT的中文简历命名实体识别[J]. 计算机工程与设计, 2024, 45(1):261-267. |
YU D D, HUANG J, DANG T X, et al. Recognition of named entity in Chinese resume based on ALBERT [J]. Computer Engineering and Design, 2024, 45(1): 261-267. | |
[24] | PENG C, WANG X, LI Q, et al. Named entity recognition based on contrastive learning and enhanced lexicon for pig diseases of Chinese corpus [EB/OL]. [2024-05-15]. . |
[25] | LIN J, LI S, QIN N, et al. Entity recognition of railway signal equipment fault information based on RoBERTa-wwm and deep learning integration [J]. Mathematical Biosciences and Engineering, 2024, 21(1): 1228-1248. |
[26] | LENG F, LI F, BAO Y, et al. DABC: a named entity recognition method incorporating attention mechanisms [J]. Mathematics, 2024, 12(13): No.1992. |
[1] | Jing YU, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Sequence labeling optimization method combined with entity boundary offset [J]. Journal of Computer Applications, 2025, 45(8): 2522-2529. |
[2] | Zhangjie XU, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Nested named entity recognition combined with boundary generation by multi-objective learning [J]. Journal of Computer Applications, 2025, 45(7): 2229-2236. |
[3] | Lixiao ZHANG, Yao MA, Yuli YANG, Dan YU, Yongle CHEN. Large-scale IoT binary component identification based on named entity recognition [J]. Journal of Computer Applications, 2025, 45(7): 2288-2295. |
[4] | Jie HU, Shuaixing WU, Zhilan CAO, Yan ZHANG. Named entity recognition model based on global information fusion and multi-dimensional relation perception [J]. Journal of Computer Applications, 2025, 45(5): 1511-1519. |
[5] | Biqing ZENG, Guangbin ZHONG, James Zhiqing WEN. Few-shot named entity recognition based on decomposed fuzzy span [J]. Journal of Computer Applications, 2025, 45(5): 1504-1510. |
[6] | Kun SHENG, Zhongqing WANG. Synaesthesia metaphor analysis based on large language model and data augmentation [J]. Journal of Computer Applications, 2025, 45(3): 794-800. |
[7] | Xueqiang LYU, Tao WANG, Xindong YOU, Ge XU. HTLR: named entity recognition framework with hierarchical fusion of multi-knowledge [J]. Journal of Computer Applications, 2025, 45(1): 40-47. |
[8] | Ying YANG, Xiaoyan HAO, Dan YU, Yao MA, Yongle CHEN. Graph data generation approach for graph neural network model extraction attacks [J]. Journal of Computer Applications, 2024, 44(8): 2483-2492. |
[9] | Huanliang SUN, Siyi WANG, Junling LIU, Jingke XU. Help-seeking information extraction model for flood event in social media data [J]. Journal of Computer Applications, 2024, 44(8): 2437-2445. |
[10] | Youren YU, Yangsen ZHANG, Yuru JIANG, Gaijuan HUANG. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information [J]. Journal of Computer Applications, 2024, 44(6): 1706-1712. |
[11] | Junfeng SHEN, Xingchen ZHOU, Can TANG. Dual-channel sentiment analysis model based on improved prompt learning method [J]. Journal of Computer Applications, 2024, 44(6): 1796-1806. |
[12] | Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708. |
[13] | Hua LAI, Tong SUN, Wenjun WANG, Zhengtao YU, Shengxiang GAO, Ling DONG. Text punctuation restoration for Vietnamese speech recognition with multimodal features [J]. Journal of Computer Applications, 2024, 44(2): 418-423. |
[14] | Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network [J]. Journal of Computer Applications, 2023, 43(8): 2406-2411. |
[15] | Jingsheng LEI, Kaijun LA, Shengying YANG, Yi WU. Joint entity and relation extraction based on contextual semantic enhancement [J]. Journal of Computer Applications, 2023, 43(5): 1438-1444. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||