Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (7): 2009-2014.DOI: 10.11772/j.issn.1001-9081.2021050877

• Artificial intelligence • Previous Articles     Next Articles

Sensitive information detection method based on attention mechanism-based ELMo

Cheng HUANG, Qianrui ZHAO()   

  1. School of Cyber Science and Engineering,Sichuan University,Chengdu Sichuan 610065,China
  • Received:2021-05-27 Revised:2021-08-27 Accepted:2021-08-30 Online:2022-01-06 Published:2022-07-10
  • Contact: Qianrui ZHAO
  • About author:HUANG Cheng, born in 1987, Ph. D., associate professor. His research interests include network security, attack and defense technology.
  • Supported by:
    National Natural Science Foundation of China(61902265);Key Research and Development Program of Science and Technology Department of Sichuan Province(2020YFG0076)

基于语言模型词嵌入和注意力机制的敏感信息检测方法

黄诚, 赵倩锐()   

  1. 四川大学 网络空间安全学院,成都 610065
  • 通讯作者: 赵倩锐
  • 作者简介:黄诚(1987—),男,重庆云阳人,副教授,博士,CCF会员,主要研究方向:网络安全、攻防技术;
  • 基金资助:
    国家自然科学基金资助项目(61902265);四川省科技厅重点研发计划项目(2020YFG0076)

Abstract:

In order to solve the problems of low accuracy and poor generalization of the traditional sensitive information detection methods such as keyword character matching-based method and phrase-level sentiment analysis-based method, a sensitive information detection method based on Attention mechanism-based Embedding from Language Model (A-ELMo) was proposed. Firstly, the quick matched of trie tree was performed to reduce the comparison of useless words significantly, thereby improving the query efficiency greatly. Secondly, an Embedding from Language Model (ELMo) was constructed for context analysis, and the dynamic word vectors were used to fully represent the context characteristics to achieve high scalability. Finally, the attention mechanism was combined to enhance the identification ability of the model for sensitive features, and further improve the detection rate of sensitive information. Experiments were carried out on real datasets composed of multiple network data sources. The results show that the accuracy of the proposed sensitive information detection method is improved by 13.3 percentage points compared with that of the phrase-level sentiment analysis-based method, and the accuracy of the proposed method is improved by 43.5 percentage points compared with that of the keyword matching-based method, verifying that the proposed method has advantages in terms of enhancing identification ability of sensitive features and improving the detection rate of sensitive information.

Key words: sensitive information, Embedding from Language Model (ELMo), context analysis, attention mechanism, trie tree

摘要:

针对基于关键词字符匹配和短语级情感分析等传统敏感信息检测方法准确率低和泛化性差的问题,提出了一种基于语言模型词嵌入和注意力机制(A-ELMo)的敏感信息检测方法。首先,进行字典树快速匹配,以最大限度地减少无用字符的比较,从而极大地提高查询效率;其次,构建了一个语言模型词嵌入模型(ELMo)进行语境分析,并通过动态词向量充分表征语境特征,从而实现较高的可扩展性;最后,结合注意力机制加强模型对敏感特征的识别度,从而进一步提升对敏感信息的检测率。在由多个网络数据源构成的真实数据集上进行实验,结果表明,所提敏感信息检测方法与基于短语级情感分析的方法相比,准确率提升了13.3个百分点;与基于关键字匹配的方法相比,准确率提升了43.5个百分点,充分验证了所提方法在加强敏感特征识别度、提高敏感信息检测率方面的优越性。

关键词: 敏感信息, 语言模型词嵌入, 语境分析, 注意力机制, 字典树

CLC Number: