Named entity recognition for sensitive information based on data augmentation and residual networks

doi:10.11772/j.issn.1001-9081.2024081143

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (9): 2790-2797.DOI: 10.11772/j.issn.1001-9081.2024081143

• Artificial intelligence • Previous Articles

Named entity recognition for sensitive information based on data augmentation and residual networks

Li LI(), Han SONG, Peihe LIU, Hanlin CHEN

Department of Electronic and Communication Engineering，Beijing Electronic Science and Technology Institute，Beijing 100070，China

Received:2024-08-14 Revised:2024-10-16 Accepted:2024-10-22 Online:2024-11-07 Published:2025-09-10
Contact: Li LI
About author:SONG Han， born in 2000， M. S. candidate. His research interests include artificial intelligence， natural language processing.
LIU Peihe， born in 1972， engineer. His research interests include network and communication security， blockchain security.
CHEN Hanlin， born in 1976， M. S.， associate professor. His research interests include information security， system integration.
Supported by:
Fundamental Research Funds for the Central Universities(3282023017);Project for Research and Practice on Innovative Talent Training Modes of Multidisciplinary Electronic Information Engineering(jy202202)

基于数据增强和残差网络的敏感信息命名实体识别

李莉(), 宋涵, 刘培鹤, 陈汉林

北京电子科技学院电子与通信工程系，北京 100070

通讯作者: 李莉
作者简介:宋涵（2000—），男，山东菏泽人，硕士研究生，主要研究方向：人工智能、自然语言处理
刘培鹤（1972—），男，黑龙江鹤岗人，工程师，主要研究方向：网络与通信安全、区块链安全
陈汉林（1976—），男，湖北广水人，副教授，硕士，主要研究方向：信息安全、系统集成。
基金资助:
中央高校基本科研业务费专项资金资助项目(3282023017);中央高校基本科研业务费专项资金资助项目(3282024006);中央高校基本科研业务费专项资金资助项目(3282023054);多学科交叉的电子信息工程创新人才培养模式的研究与实践项目(jy202202)

Abstract

Abstract:

Named Entity Recognition （NER） for sensitive information is a key technology of privacy protection. However， the existing NER methods face challenges in the sensitive information domain due to the scarcity of relevant datasets and the traditional techniques have problems such as low accuracy and poor portability. To address these issues， firstly， a sensitive information NER dataset， SenResume， was constructed by crawling and manually annotating text corpora containing sensitive information from the Internet. Secondly， a data augmentation model — Entity-based Masked Language Modeling （E-MLM） was proposed to utilize whole-word masking technique to generate new data samples， and expand the dataset to enhance data diversity. Thirdly， a RoBERTa-ResBiLSTM-CRF model was introduced， which combined the Robustly optimized Bidirectional Encoder Representations from Transformers approach with Whole Word Masking （RoBERTa-WWM） to extract contextual features for generating high-quality word vector representations， while ResBiLSTM （Residual Bidirectional Long Short-Term Memory） was employed to enhance text features. Finally， a multi-layer residual network was applied to improve training efficiency and model stability， and Conditional Random Field （CRF） was used for global decoding to enhance the accuracy of sequence labeling. Experimental results demonstrate that E-MLM improves dataset quality significantly， and the proposed NER model achieves the optimal performance on both the original and 1x augmented datasets， with F1 scores of 96.16% and 97.84%， respectively. It can be seen that the introduction of E-MLM and residual networks contribute to improvements in the accuracy of sensitive information NER.

Key words: sensitive information, dataset construction, data enhancement, Bidirectional Encoder Representations from Transformers (BERT), Named Entity Recognition (NER)

摘要：

敏感信息命名实体识别（NER）是隐私保护的关键技术之一。然而，现有的NER方法在敏感信息领域的相关数据集稀缺，且传统技术存在准确率低、可移植性差等问题。为解决这些问题，首先，从互联网中爬取并人工标注含有敏感信息的文本语料，以构建敏感信息NER数据集SenResume；其次，提出一种基于实体掩码的数据增强模型E-MLM（Entity-based Masked Language Modeling），通过整词掩码技术生成新的数据样本，并扩充数据集以提升数据多样性；再次，提出RoBERTa-ResBiLSTM-CRF模型，该模型结合RoBERTa-WWM（Robustly optimized Bidirectional Encoder Representations from Transformers approach with Whole Word Masking）提取上下文特征以生成高质量的词向量编码，并利用残差双向长短期记忆（ResBiLSTM）增强文本特征；最后，通过多层残差网络提高训练效率和模型稳定性，并通过条件随机场（CRF）进行全局解码以提升序列标注的准确性。实验结果表明，E-MLM对数据集质量有显著的提升，并且提出的NER模型在原始和1倍扩充后的数据集上表现均为最优，F1分数分别为96.16%和97.84%。可见，E-MLM与残差网络的引入有利于提升敏感信息NER的准确度。

关键词: 敏感信息, 数据集构建, 数据增强, BERT, 命名实体识别

CLC Number:

TP391

Li LI, Han SONG, Peihe LIU, Hanlin CHEN. Named entity recognition for sensitive information based on data augmentation and residual networks[J]. Journal of Computer Applications, 2025, 45(9): 2790-2797.

李莉, 宋涵, 刘培鹤, 陈汉林. 基于数据增强和残差网络的敏感信息命名实体识别[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2790-2797.

Figures/Tables 13

Tab. 1 Named entity specific types， designations， and explanations

实体类型	代号	解释
国籍	CONT	一个人或组织所属的国家或国家地区
教育背景	EDU	通常用来表示一个人的教育程度、学历和学习经历等信息
地名	LOC	一种特定类型的实体，用来表示地理位置或地理实体的名称
人名	NAME	指代具体的个人，可以是真实存在人的姓名，也可是虚构的人物
组织名	ORG	公司、政府机构、非营利组织、团体、学校等组织或机构
专业	PRO	用于描述某人所学习或从事的特定领域或专业知识
民族	RACE	涉及个人身份或特定群体身份的敏感信息之一
职称	TITLE	在职场或社会中所担任的特定职务或头衔，与个人的身份、责任等密切相关

Tab. 2 BIOES entity labeling rules

实体类型	头实体	实体内部	尾实体	单实体
国籍	B-CONT	I-CONT	E-CONT	S-CONT
教育背景	B-EDU	I-EDU	E-EDU	S-EDU
地名	B-LOC	I-LOC	E-LOC	S-LOC
人名	B-NAME	I-NAME	E-NAME	S-NAME
组织名	B-ORG	I-ORG	E-ORG	S-ORG
专业	B-PRO	I-PRO	E-PRO	S-PRO
民族	B-RACE	I-RACE	E-RACE	S-RACE
职称	B-TITLE	I-TITLE	E-TITLE	S-TITLE

Tab. 3 Numbers of labeled entities

实体类型	标注数	实体类型	标注数
国籍（CONT）	1 361	组织名（ORG）	5 810
教育背景（EDU）	510	专业（PRO）	687
地名（LOC）	547	民族（RACE）	469
人名（NAME）	1 258	职称（TITLE）	7 908

Fig. 1 Flow of E-MLM data augmentation

Fig. 2 RoBERTa-ResBiLSTM-CRF model structure

Fig. 3 Schematic diagram of RoBERTa-WWM model masks

Fig. 4 RoBERTa-WWM model structure

Fig. 5 Structure of LSTM model

Tab. 4 Hyperparameter values

超参数	取值
Dropout	0.5
Epoch	30
Batch_size	64
LSTM隐藏层维度	768
序列最大长度	512
学习率	$2 × 10 - 5$

Tab. 4 Hyperparameter values

超参数	取值
Dropout	0.5
Epoch	30
Batch_size	64
LSTM隐藏层维度	768
序列最大长度	512
学习率	$2 × 10 - 5$

Tab. 5 Effects of data enhancement under different methods

数据集简历数	增强方案	P/%	R/%	F1分数/%
100	原数据	72.77	67.66	70.26
	E-MLM	75.21	70.03	72.54
	同类实体替换	73.37	68.87	71.06
	实体上下文替换	74.85	67.60	71.02
	实体上下文删除	72.41	68.14	70.20
200	原数据	79.47	74.76	77.06
	E-MLM	82.83	77.21	80.16
	同类实体替换	80.69	75.90	78.31
	实体上下文替换	81.33	74.56	77.87
	实体上下文删除	79.68	75.09	77.36
400	原数据	85.53	80.82	82.99
	E-MLM	88.92	84.45	86.62
	同类实体替换	86.71	81.98	84.48
	实体上下文替换	87.21	80.66	83.86
	实体上下文删除	85.62	81.12	83.53

Tab. 6 Comparison of effects of different NER methods

模型	P	R	F1分数
BiLSTM-CRF	89.06	91.53	90.28
BERT-BiLSTM-CRF	93.72	93.38	93.54
ALBERT-BiLSTM-CRF	94.32	94.27	93.91
LEBERT-BiLSTM-CRF	94.54	94.33	94.43
RoBERTa-WWM-BiLSTM-CRF	95.53	94.65	95.09
RoBERTa-ResBiLSTM-CRF	95.97	96.37	96.16

Tab. 7 Comparison of overall recognition effects of proposed model on different datasets

数据集	数据集	P	R	F1分数
SenResume	原始数据集	95.97	96.37	96.16
SenResume	1倍扩充数据集	97.48	98.23	97.84
Weibo	原始数据集	72.75	73.37	73.06
Weibo	1倍扩充数据集	80.42	80.11	80.26
MSRA	原始数据集	91.27	90.89	91.08
MSRA	1倍扩充数据集	93.55	93.20	93.37
CLUENER2020	原始数据集	82.34	82.77	82.55
CLUENER2020	1倍扩充数据集	85.73	86.02	85.87

Tab. 8 Cross-domain and same-domain training and test results for datasets

实验序号	训练集	测试集	P	R	F1分数
1	SenResume	SenResume	95.97	96.37	96.16
2	人民日报	人民日报	96.04	95.30	95.67
3	人民日报	SenResume	82.52	83.20	82.96
4	SenResume	人民日报	92.15	90.75	91.44
5	1倍扩充人民日报	SenResume	85.25	85.87	84.02
6	1倍扩充SenResume	人民日报	94.15	93.75	91.44

References 26

[1]	杜晋华，尹浩，冯嵩. 中文电子病历命名实体识别的研究与进展［J］. 电子学报， 2022， 50（12）： 3030-3053.
	DU J H， YIN H， FENG S. Research and development of named entity recognition in Chinese electronic medical record ［J］. Acta Electronica Sinica， 2022， 50（12）： 3030-3053.
[2]	CHU J， LIU Y， YUE Q， et al. Named entity recognition in aerospace based on multi-feature fusion Transformer ［J］. Scientific Reports， 2024， 14： No.827.
[3]	LANDOLSI M Y， ROMDHANE L BEN， HLAOUA L. Hybrid medical named entity recognition using document structure and surrounding context ［J］. The Journal of Supercomputing， 2024， 80（4）： 5011-5041.
[4]	LI J. Support vector machine and hidden Markov model in name entity recognition of natural language processing ［J］. Science and Technology of Engineering， Chemistry and Environmental Protection， 2024， 2（3）： No.E001472.
[5]	XU Y， MAO C， WANG Z， et al. Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books ［J］. Scientific Reports， 2024， 14： No.17488.
[6]	COLLOBERT R， WESTON J， BOTTOU L， et al. Natural language processing （almost） from scratch ［J］. Journal of Machine Learning Research， 2011， 12： 2493-2537.
[7]	HUANG Z， XU W， YU K. Bidirectional LSTM-CRF models for sequence tagging ［EB/OL］. ［2024-06-15］..
[8]	LAMPLE G， BALLESTEROS M， SUBRAMANIAN S， et al. Neural architectures for named entity recognition ［C］// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2016： 260-270.
[9]	MA X， HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF ［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2016： 1064-1074.
[10]	ZHANG Y， YANG J. Chinese NER using lattice LSTM ［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 1554-1564.
[11]	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
[12]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[13]	曾兰兰，王以松，陈攀峰. 基于BERT和联合学习的裁判文书命名实体识别［J］. 计算机应用， 2022， 42（10）： 3011-3017.
	ZENG L L， WANG Y S， CHEN P F. Named entity recognition based on BERT and joint learning for judgment documents ［J］. Journal of Computer Applications， 2022， 42（10）： 3011-3017.
[14]	郑立瑞，肖晓霞，邹北骥，等. 基于BERT的电子病历命名实体识别［J］. 计算机与现代化， 2024（1）： 87-91.
	ZHENG L R， XIAO X X， ZOU B J， et al. Named entity recognition in electronic medical record based on BERT ［J］. Computer and Modernization， 2024（1）：87-91.
[15]	WEI J， ZOU K. EDA： easy data augmentation techniques for boosting performance on text classification tasks ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 6382-6388.
[16]	FENG S Y， GANGAL V， KANG D， et al. GenAug： data augmentation for finetuning text generators ［C］// Proceedings of Deep Learning Inside Out （DeeLIO）： The 1st Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Stroudsburg： ACL， 2020： 29-42.
[17]	ANDREAS J. Good-enough compositional data augmentation ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 7556-7566.
[18]	KARIMI A， ROSSI L， PRATI A. AEDA： an easier data augmentation technique for text classification ［C］// Findings of the Association for Computational Linguistics： EMNLP 2021. Stroudsburg： ACL， 2021： 2748-2754.
[19]	LYU S， SUN L， YI H， et al. Converse attention knowledge transfer for low-resource named entity recognition ［J］. International Journal of Crowd Science， 2024， 8（3）： 140-148.
[20]	JAIN A， PARANJAPE B， LIPTON Z C. Entity projection via machine translation for cross-lingual NER ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 1083-1092.
[21]	PAN J， ZHANG C， WANG H， et al. A comparative study of Chinese named entity recognition with different segment representations ［J］. Applied Intelligence， 2022， 52（11）： 12457-12469.
[22]	QI D， WANG B， ZHAO Q， et al. Research on the spatial network structure of tourist flows in Hangzhou based on BERT-BiLSTM-CRF ［J］. ISPRS International Journal of Geo-Information， 2024， 13（4）： No.139.
[23]	余丹丹，黄洁，党同心，等. 基于ALBERT的中文简历命名实体识别［J］. 计算机工程与设计， 2024， 45（1）：261-267.
	YU D D， HUANG J， DANG T X， et al. Recognition of named entity in Chinese resume based on ALBERT ［J］. Computer Engineering and Design， 2024， 45（1）： 261-267.
[24]	PENG C， WANG X， LI Q， et al. Named entity recognition based on contrastive learning and enhanced lexicon for pig diseases of Chinese corpus ［EB/OL］. ［2024-05-15］. .
[25]	LIN J， LI S， QIN N， et al. Entity recognition of railway signal equipment fault information based on RoBERTa-wwm and deep learning integration ［J］. Mathematical Biosciences and Engineering， 2024， 21（1）： 1228-1248.
[26]	LENG F， LI F， BAO Y， et al. DABC： a named entity recognition method incorporating attention mechanisms ［J］. Mathematics， 2024， 12（13）： No.1992.

Named entity recognition for sensitive information based on data augmentation and residual networks

基于数据增强和残差网络的敏感信息命名实体识别

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 13

References 26

Related Articles 15

Recommended Articles

Metrics

[1]	Jing YU, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Sequence labeling optimization method combined with entity boundary offset [J]. Journal of Computer Applications, 2025, 45(8): 2522-2529.
[2]	Zhangjie XU, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Nested named entity recognition combined with boundary generation by multi-objective learning [J]. Journal of Computer Applications, 2025, 45(7): 2229-2236.
[3]	Lixiao ZHANG, Yao MA, Yuli YANG, Dan YU, Yongle CHEN. Large-scale IoT binary component identification based on named entity recognition [J]. Journal of Computer Applications, 2025, 45(7): 2288-2295.
[4]	Jie HU, Shuaixing WU, Zhilan CAO, Yan ZHANG. Named entity recognition model based on global information fusion and multi-dimensional relation perception [J]. Journal of Computer Applications, 2025, 45(5): 1511-1519.
[5]	Biqing ZENG, Guangbin ZHONG, James Zhiqing WEN. Few-shot named entity recognition based on decomposed fuzzy span [J]. Journal of Computer Applications, 2025, 45(5): 1504-1510.
[6]	Kun SHENG, Zhongqing WANG. Synaesthesia metaphor analysis based on large language model and data augmentation [J]. Journal of Computer Applications, 2025, 45(3): 794-800.
[7]	Xueqiang LYU, Tao WANG, Xindong YOU, Ge XU. HTLR： named entity recognition framework with hierarchical fusion of multi-knowledge [J]. Journal of Computer Applications, 2025, 45(1): 40-47.
[8]	Ying YANG, Xiaoyan HAO, Dan YU, Yao MA, Yongle CHEN. Graph data generation approach for graph neural network model extraction attacks [J]. Journal of Computer Applications, 2024, 44(8): 2483-2492.
[9]	Huanliang SUN, Siyi WANG, Junling LIU, Jingke XU. Help-seeking information extraction model for flood event in social media data [J]. Journal of Computer Applications, 2024, 44(8): 2437-2445.
[10]	Youren YU, Yangsen ZHANG, Yuru JIANG, Gaijuan HUANG. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information [J]. Journal of Computer Applications, 2024, 44(6): 1706-1712.
[11]	Junfeng SHEN, Xingchen ZHOU, Can TANG. Dual-channel sentiment analysis model based on improved prompt learning method [J]. Journal of Computer Applications, 2024, 44(6): 1796-1806.
[12]	Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708.
[13]	Hua LAI, Tong SUN, Wenjun WANG, Zhengtao YU, Shengxiang GAO, Ling DONG. Text punctuation restoration for Vietnamese speech recognition with multimodal features [J]. Journal of Computer Applications, 2024, 44(2): 418-423.
[14]	Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network [J]. Journal of Computer Applications, 2023, 43(8): 2406-2411.
[15]	Jingsheng LEI, Kaijun LA, Shengying YANG, Yi WU. Joint entity and relation extraction based on contextual semantic enhancement [J]. Journal of Computer Applications, 2023, 43(5): 1438-1444.