Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (9): 2790-2797.DOI: 10.11772/j.issn.1001-9081.2024081143

• Artificial intelligence • Previous Articles    

Named entity recognition for sensitive information based on data augmentation and residual networks

Li LI(), Han SONG, Peihe LIU, Hanlin CHEN   

  1. Department of Electronic and Communication Engineering,Beijing Electronic Science and Technology Institute,Beijing 100070,China
  • Received:2024-08-14 Revised:2024-10-16 Accepted:2024-10-22 Online:2024-11-07 Published:2025-09-10
  • Contact: Li LI
  • About author:SONG Han, born in 2000, M. S. candidate. His research interests include artificial intelligence, natural language processing.
    LIU Peihe, born in 1972, engineer. His research interests include network and communication security, blockchain security.
    CHEN Hanlin, born in 1976, M. S., associate professor. His research interests include information security, system integration.
  • Supported by:
    Fundamental Research Funds for the Central Universities(3282023017);Project for Research and Practice on Innovative Talent Training Modes of Multidisciplinary Electronic Information Engineering(jy202202)

基于数据增强和残差网络的敏感信息命名实体识别

李莉(), 宋涵, 刘培鹤, 陈汉林   

  1. 北京电子科技学院 电子与通信工程系,北京 100070
  • 通讯作者: 李莉
  • 作者简介:宋涵(2000—),男,山东菏泽人,硕士研究生,主要研究方向:人工智能、自然语言处理
    刘培鹤(1972—),男,黑龙江鹤岗人,工程师,主要研究方向:网络与通信安全、区块链安全
    陈汉林(1976—),男,湖北广水人,副教授,硕士,主要研究方向:信息安全、系统集成。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(3282023017);中央高校基本科研业务费专项资金资助项目(3282024006);中央高校基本科研业务费专项资金资助项目(3282023054);多学科交叉的电子信息工程创新人才培养模式的研究与实践项目(jy202202)

Abstract:

Named Entity Recognition (NER) for sensitive information is a key technology of privacy protection. However, the existing NER methods face challenges in the sensitive information domain due to the scarcity of relevant datasets and the traditional techniques have problems such as low accuracy and poor portability. To address these issues, firstly, a sensitive information NER dataset, SenResume, was constructed by crawling and manually annotating text corpora containing sensitive information from the Internet. Secondly, a data augmentation model — Entity-based Masked Language Modeling (E-MLM) was proposed to utilize whole-word masking technique to generate new data samples, and expand the dataset to enhance data diversity. Thirdly, a RoBERTa-ResBiLSTM-CRF model was introduced, which combined the Robustly optimized Bidirectional Encoder Representations from Transformers approach with Whole Word Masking (RoBERTa-WWM) to extract contextual features for generating high-quality word vector representations, while ResBiLSTM (Residual Bidirectional Long Short-Term Memory) was employed to enhance text features. Finally, a multi-layer residual network was applied to improve training efficiency and model stability, and Conditional Random Field (CRF) was used for global decoding to enhance the accuracy of sequence labeling. Experimental results demonstrate that E-MLM improves dataset quality significantly, and the proposed NER model achieves the optimal performance on both the original and 1x augmented datasets, with F1 scores of 96.16% and 97.84%, respectively. It can be seen that the introduction of E-MLM and residual networks contribute to improvements in the accuracy of sensitive information NER.

Key words: sensitive information, dataset construction, data enhancement, Bidirectional Encoder Representations from Transformers (BERT), Named Entity Recognition (NER)

摘要:

敏感信息命名实体识别(NER)是隐私保护的关键技术之一。然而,现有的NER方法在敏感信息领域的相关数据集稀缺,且传统技术存在准确率低、可移植性差等问题。为解决这些问题,首先,从互联网中爬取并人工标注含有敏感信息的文本语料,以构建敏感信息NER数据集SenResume;其次,提出一种基于实体掩码的数据增强模型E-MLM(Entity-based Masked Language Modeling),通过整词掩码技术生成新的数据样本,并扩充数据集以提升数据多样性;再次,提出RoBERTa-ResBiLSTM-CRF模型,该模型结合RoBERTa-WWM(Robustly optimized Bidirectional Encoder Representations from Transformers approach with Whole Word Masking)提取上下文特征以生成高质量的词向量编码,并利用残差双向长短期记忆(ResBiLSTM)增强文本特征;最后,通过多层残差网络提高训练效率和模型稳定性,并通过条件随机场(CRF)进行全局解码以提升序列标注的准确性。实验结果表明,E-MLM对数据集质量有显著的提升,并且提出的NER模型在原始和1倍扩充后的数据集上表现均为最优,F1分数分别为96.16%和97.84%。可见,E-MLM与残差网络的引入有利于提升敏感信息NER的准确度。

关键词: 敏感信息, 数据集构建, 数据增强, BERT, 命名实体识别

CLC Number: