Sensitive information detection method based on attention mechanism-based ELMo

doi:10.11772/j.issn.1001-9081.2021050877

Abstract

Abstract:

In order to solve the problems of low accuracy and poor generalization of the traditional sensitive information detection methods such as keyword character matching-based method and phrase-level sentiment analysis-based method， a sensitive information detection method based on Attention mechanism-based Embedding from Language Model （A-ELMo） was proposed. Firstly， the quick matched of trie tree was performed to reduce the comparison of useless words significantly， thereby improving the query efficiency greatly. Secondly， an Embedding from Language Model （ELMo） was constructed for context analysis， and the dynamic word vectors were used to fully represent the context characteristics to achieve high scalability. Finally， the attention mechanism was combined to enhance the identification ability of the model for sensitive features， and further improve the detection rate of sensitive information. Experiments were carried out on real datasets composed of multiple network data sources. The results show that the accuracy of the proposed sensitive information detection method is improved by 13.3 percentage points compared with that of the phrase-level sentiment analysis-based method， and the accuracy of the proposed method is improved by 43.5 percentage points compared with that of the keyword matching-based method， verifying that the proposed method has advantages in terms of enhancing identification ability of sensitive features and improving the detection rate of sensitive information.

Key words: sensitive information, Embedding from Language Model (ELMo), context analysis, attention mechanism, trie tree

摘要：

针对基于关键词字符匹配和短语级情感分析等传统敏感信息检测方法准确率低和泛化性差的问题，提出了一种基于语言模型词嵌入和注意力机制（A-ELMo）的敏感信息检测方法。首先，进行字典树快速匹配，以最大限度地减少无用字符的比较，从而极大地提高查询效率；其次，构建了一个语言模型词嵌入模型（ELMo）进行语境分析，并通过动态词向量充分表征语境特征，从而实现较高的可扩展性；最后，结合注意力机制加强模型对敏感特征的识别度，从而进一步提升对敏感信息的检测率。在由多个网络数据源构成的真实数据集上进行实验，结果表明，所提敏感信息检测方法与基于短语级情感分析的方法相比，准确率提升了13.3个百分点；与基于关键字匹配的方法相比，准确率提升了43.5个百分点，充分验证了所提方法在加强敏感特征识别度、提高敏感信息检测率方面的优越性。

关键词: 敏感信息, 语言模型词嵌入, 语境分析, 注意力机制, 字典树

CLC Number:

TP183

Cheng HUANG, Qianrui ZHAO. Sensitive information detection method based on attention mechanism-based ELMo[J]. Journal of Computer Applications, 2022, 42(7): 2009-2014.

黄诚, 赵倩锐. 基于语言模型词嵌入和注意力机制的敏感信息检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2009-2014.

Figures/Tables 7

References 28

1	QIAO H， TIAN Z， LI W L， et al. A sensitive information detection method based on network traffic restore［C］// Proceedings of the 12th International Conference on Measuring Technology and Mechatronics Automation. Piscataway： IEEE， 2020： 832-836. 10.1109/icmtma50254.2020.00181
2	XU Y Y， LI Y X， ZHANG Z Y. Sensitive text classification and detection method based on sentiment analysis［J］. International Core Journal of Engineering， 2021， 7（5）： 60-66.
3	DIAS M， BONÉ J， FERREIRA J C， et al. Named entity recognition for sensitive data discovery in Portuguese［J］. Applied Sciences， 2020， 10（7）： No.2303. 10.3390/app10072303
4	ESIN Y E， ALAN O， ALPASLAN F N. Improvement on corpus- based word similarity using vector space models［C］// Proceedings of the 24th International Symposium on Computer and Information Sciences. Piscataway： IEEE， 2009： 280-285. 10.1109/iscis.2009.5291827
5	SUNDERMEYER M， SCHLÜTER R， NEY H. LSTM Neural networks for language modeling［C］// Proceedings of the Interspeech 2012. ［S.l.］： International Speech Communication Association， 2012： 194-197.
6	LIU W Y， WEN Y D， YU Z D， et al. Large-margin softmax loss for convolutional neural networks［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 507-516.
7	GUTHRIE D， ALLISON B， LIU W， et al. A closer look at skip-gram modelling［C］// Proceedings of the 5th International Conference on Language Resources and Evaluation. ［S.l.］： European Language Resources Association， 2006： 1222-1225.
8	邓一贵，伍玉英. 基于文本内容的敏感词决策树信息过滤算法［J］. 计算机工程， 2014， 40（9）：300-304. 10. 3969/ j. issn. 1000-3428. 2014. 09. 060
	DENG Y G， WU Y Y. Information filtering algorithm of test content-based sensitive words decision tree［J］. Computer Engineering， 2014， 40（9）： 300-304. 10. 3969/ j. issn. 1000-3428. 2014. 09. 060
9	付聪，余敦辉，张灵莉. 面向中文敏感词变形体的识别方法研究［J］.计算机应用研究， 2019， 36（4）：988-991. 10.19734/j.issn.1001-3695.2017.11.0996
	FU C， YU D H， ZHANG L L. Study on identification method for change from of Chinese sensitive words［J］. Application Research of Computers， 2019， 36（4）： 988-991. 10.19734/j.issn.1001-3695.2017.11.0996
10	李扬，潘泉，杨涛. 基于短文本情感分析的敏感信息识别［J］. 西安交通大学学报， 2016， 50（9）：80-84. 10.7652/xjtuxb201609013
	LI Y， PAN Q， YANG T. Sensitive information recognition based on short text sentiment analysis［J］. Journal of Xi’an Jiaotong University， 2016， 50（9）： 80-84. 10.7652/xjtuxb201609013
11	姚艳秋，郑雅雯，吕妍欣. 基于LS-SO算法的情感文本分类方法［J］. 吉林大学学报（理学版）， 2019， 57（2）：375-379. 10.13413/j.cnki.jdxblxb.2018241
	YAO Y Q， ZHENG Y W， LYU Y X. Emotional text classification method based on LS-SO algorithm［J］. Journal of Jilin University （Science Edition）， 2019， 57（2）： 375-379. 10.13413/j.cnki.jdxblxb.2018241
12	胡思才，孙界平，琚生根，等. 基于扩展的情感词典和卡方模型的中文情感特征选择方法［J］. 四川大学学报（自然科学版）， 2019， 56（1）：37-44.
	HU S C， SUN J P， JU S G， et al. Chinese emotion feature selection method based on the extended emotion dictionary and the chi-square model［J］. Journal of Sichuan University （Natural Science Edition）， 2019， 56（1）： 37-44.
13	明弋洋，刘晓洁. 基于短语级情感分析的不良信息检测方法［J］. 四川大学学报（自然科学版）， 2019， 56（6）：1042-1048.
	MING Y Y， LIU X J. Sensitive information detection based on phrase-level sentiment analysis［J］. Journal of Sichuan University （Natural Science Edition）， 2019， 56（6）：1042-1048.
14	GUO Y Y， LIU J Y， TANG W W， et al. ExSense： extract sensitive information from unstructured data［J］. Computers and Security， 2021， 102： No.102156. 10.1016/j.cose.2020.102156
15	WANG Y J， SHEN X J， YANG Y J. The classification of Chinese sensitive information based on BERT-CNN［C］// Proceedings of the 2019 International Symposium on Intelligence Computation and Applications， CCIS 1205. Singapore： Springer， 2020： 269-280.
16	薛朋强，努尔布力，吾守尔·斯拉木. 基于网络文本信息的敏感信息过滤算法［J］. 计算机工程与设计， 2016， 37（9）：2447-2452.
	XUE P Q， NURBOL， ISLAM W. Sensitive information filtering algorithm based on text information network［J］. Computer Engineering and Design， 2016， 37（9）： 2447-2452.
17	FU Y， YU Y， WU X P. A sensitive word detection method based on variants recognition［C］// Proceedings of the 2019 International Conference on Machine Learning， Big Data and Business Intelligence. Piscataway： IEEE， 2019： 47-52. 10.1109/mlbdbi48998.2019.00017
18	DING M， WANG X， WU C M， et al. Research on automated detection of sensitive information based on BERT［J］. Journal of Physics： Conference Series， 2021， 1757： No.012088. 10.1088/1742-6596/1757/1/012088
19	BIGONHA M A S， FERREIRA K， SOUZA P， et al. The usefulness of software metric thresholds for detection of bad smells and fault prediction［J］. Information and Software Technology， 2019， 115： 79-92. 10.1016/j.infsof.2019.08.005
20	李丹阳，赵亚慧，罗梦江，等. 基于字典树语言模型的专业课查询文本校对方法［J］. 延边大学学报（自然科学版）， 2020， 46（3）：260-264.
	LI D Y， ZHAO Y H， LUO M J， et al. Query text proofreading method of professional courses based on trie tree language model［J］. Journal of Yanbian University （Natural Science）， 2020， 46（3）： 260-264.
21	LOPEZ M M， KALITA J. Deep learning applied to NLP［EB/OL］. （2017-03-09）［2021-03-13］..
22	周飞燕，金林鹏，董军. 卷积神经网络研究综述［J］. 计算机学报， 2017， 40（6）：1229-1251. 10.11897/SP.J.1016.2017.01229
	ZHOU F Y， JIN L P， DONG J. Review of convolutional neural network［J］. Chinese Journal of Computers， 2017， 40（6）：1229-1251. 10.11897/SP.J.1016.2017.01229
23	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1532-1543. 10.3115/v1/d14-1162
24	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space［EB/OL］. （2013-09-07）［2021-03-13］.. 10.3126/jiee.v3i1.34327
25	JOULIN A， GRAVE E， BOJANOWSKI P， et al. Bag of tricks for efficient text classification［C］// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics （Volume 2： Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2017： 427-431. 10.18653/v1/e17-2068
26	SHARMIN S， CHAKMA D. Attention-based convolutional neural network for Bangla sentiment analysis［J］. AI and Society， 2021， 36（1）： 381-396. 10.1007/s00146-020-01011-0
27	LIU Y， YANG C Y， YANG J. A graph convolutional network-based sensitive information detection algorithm［J］. Complexity， 2021， 2021： No.6631768. 10.1155/2021/6631768
28	BENGIO Y， DUCHARME R， VINCENT P， et al. A neural probabilistic language model［J］. Journal of Machine Learning Research， 2003， 3： 1137-1155.

数据类型	数据总量	训练数据数	测试数据数
人民网	31 560	22 092	9 468
新华网	33 770	23 639	10 131
央视新闻	12 430	8 701	3 729
敏感博客文章	52 470	36 729	15 741

数据类型	数据总量	训练数据数	测试数据数
人民网	31 560	22 092	9 468
新华网	33 770	23 639	10 131
央视新闻	12 430	8 701	3 729
敏感博客文章	52 470	36 729	15 741

方法	准确率	召回率	精确率
本文方法	84.2	93.8	78.7
短语级情感分析^［13］	73.2	89.4	62.5
关键字匹配	24.1	82.6	38.7

方法	准确率	召回率	精确率
本文方法	84.2	93.8	78.7
短语级情感分析^［13］	73.2	89.4	62.5
关键字匹配	24.1	82.6	38.7

方法	准确率	召回率	精确率
本文方法	85.0	89.3	84.5
短语级情感分析^［13］	71.7	82.4	70.8
关键字匹配	41.5	68.0	56.2