Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (7): 2009-2014.DOI: 10.11772/j.issn.1001-9081.2021050877
Special Issue: 人工智能
• Artificial intelligence • Previous Articles Next Articles
Received:
2021-05-27
Revised:
2021-08-27
Accepted:
2021-08-30
Online:
2022-01-06
Published:
2022-07-10
Contact:
Qianrui ZHAO
About author:
HUANG Cheng, born in 1987, Ph. D., associate professor. His research interests include network security, attack and defense technology.
Supported by:
通讯作者:
赵倩锐
作者简介:
黄诚(1987—),男,重庆云阳人,副教授,博士,CCF会员,主要研究方向:网络安全、攻防技术;
基金资助:
CLC Number:
Cheng HUANG, Qianrui ZHAO. Sensitive information detection method based on attention mechanism-based ELMo[J]. Journal of Computer Applications, 2022, 42(7): 2009-2014.
黄诚, 赵倩锐. 基于语言模型词嵌入和注意力机制的敏感信息检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2009-2014.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021050877
数据类型 | 数据总量 | 训练数据数 | 测试数据数 |
---|---|---|---|
人民网 | 31 560 | 22 092 | 9 468 |
新华网 | 33 770 | 23 639 | 10 131 |
央视新闻 | 12 430 | 8 701 | 3 729 |
敏感博客文章 | 52 470 | 36 729 | 15 741 |
Tab. 1 Experimental datasets
数据类型 | 数据总量 | 训练数据数 | 测试数据数 |
---|---|---|---|
人民网 | 31 560 | 22 092 | 9 468 |
新华网 | 33 770 | 23 639 | 10 131 |
央视新闻 | 12 430 | 8 701 | 3 729 |
敏感博客文章 | 52 470 | 36 729 | 15 741 |
方法 | 准确率 | 召回率 | 精确率 |
---|---|---|---|
本文方法 | 84.2 | 93.8 | 78.7 |
短语级情感分析[ | 73.2 | 89.4 | 62.5 |
关键字匹配 | 24.1 | 82.6 | 38.7 |
Tab. 2 Performance comparison of three methods on three indicators
方法 | 准确率 | 召回率 | 精确率 |
---|---|---|---|
本文方法 | 84.2 | 93.8 | 78.7 |
短语级情感分析[ | 73.2 | 89.4 | 62.5 |
关键字匹配 | 24.1 | 82.6 | 38.7 |
方法 | 准确率 | 召回率 | 精确率 |
---|---|---|---|
本文方法 | 85.0 | 89.3 | 84.5 |
短语级情感分析[ | 71.7 | 82.4 | 70.8 |
关键字匹配 | 41.5 | 68.0 | 56.2 |
Tab. 3 Performance comparison of three methods on three indicators after data randomization
方法 | 准确率 | 召回率 | 精确率 |
---|---|---|---|
本文方法 | 85.0 | 89.3 | 84.5 |
短语级情感分析[ | 71.7 | 82.4 | 70.8 |
关键字匹配 | 41.5 | 68.0 | 56.2 |
方法 | 准确率 | 召回率 | 精确率 |
---|---|---|---|
本文方法 | 85.0 | 93.8 | 78.7 |
FastText[ | 75.9 | 91.4 | 65.1 |
Word2Vec[ | 67.3 | 84.9 | 59.0 |
GloVe[ | 62.6 | 85.1 | 53.4 |
文献[ | 82.1 | 95.2 | 72.0 |
文献[ | 82.9 | 94.7 | 73.4 |
无词向量嵌入 | 52.1 | 67.5 | 48.0 |
Tab. 4 Performance comparison of word embedding models used in seven methods on three indicators
方法 | 准确率 | 召回率 | 精确率 |
---|---|---|---|
本文方法 | 85.0 | 93.8 | 78.7 |
FastText[ | 75.9 | 91.4 | 65.1 |
Word2Vec[ | 67.3 | 84.9 | 59.0 |
GloVe[ | 62.6 | 85.1 | 53.4 |
文献[ | 82.1 | 95.2 | 72.0 |
文献[ | 82.9 | 94.7 | 73.4 |
无词向量嵌入 | 52.1 | 67.5 | 48.0 |
方法 | 训练轮数 | 训练时间/min | 测试时间/s | 准确率/% |
---|---|---|---|---|
Bi-LSTM | 12 | 45 | 108 | 85.0 |
1D-CNN | 70 | 19 | 30 | 64.5 |
层级softmax[ | 55 | 13 | 22 | 72.1 |
Transformer[ | 36 | 31 | 74 | 69.2 |
Tab. 5 Performance comparison of four language models on four indicators
方法 | 训练轮数 | 训练时间/min | 测试时间/s | 准确率/% |
---|---|---|---|---|
Bi-LSTM | 12 | 45 | 108 | 85.0 |
1D-CNN | 70 | 19 | 30 | 64.5 |
层级softmax[ | 55 | 13 | 22 | 72.1 |
Transformer[ | 36 | 31 | 74 | 69.2 |
1 | QIAO H, TIAN Z, LI W L, et al. A sensitive information detection method based on network traffic restore[C]// Proceedings of the 12th International Conference on Measuring Technology and Mechatronics Automation. Piscataway: IEEE, 2020: 832-836. 10.1109/icmtma50254.2020.00181 |
2 | XU Y Y, LI Y X, ZHANG Z Y. Sensitive text classification and detection method based on sentiment analysis[J]. International Core Journal of Engineering, 2021, 7(5): 60-66. |
3 | DIAS M, BONÉ J, FERREIRA J C, et al. Named entity recognition for sensitive data discovery in Portuguese[J]. Applied Sciences, 2020, 10(7): No.2303. 10.3390/app10072303 |
4 | ESIN Y E, ALAN O, ALPASLAN F N. Improvement on corpus- based word similarity using vector space models[C]// Proceedings of the 24th International Symposium on Computer and Information Sciences. Piscataway: IEEE, 2009: 280-285. 10.1109/iscis.2009.5291827 |
5 | SUNDERMEYER M, SCHLÜTER R, NEY H. LSTM Neural networks for language modeling[C]// Proceedings of the Interspeech 2012. [S.l.]: International Speech Communication Association, 2012: 194-197. |
6 | LIU W Y, WEN Y D, YU Z D, et al. Large-margin softmax loss for convolutional neural networks[C]// Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR.org, 2016: 507-516. |
7 | GUTHRIE D, ALLISON B, LIU W, et al. A closer look at skip-gram modelling[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. [S.l.]: European Language Resources Association, 2006: 1222-1225. |
8 | 邓一贵,伍玉英. 基于文本内容的敏感词决策树信息过滤算法[J]. 计算机工程, 2014, 40(9):300-304. 10. 3969/ j. issn. 1000-3428. 2014. 09. 060 |
DENG Y G, WU Y Y. Information filtering algorithm of test content-based sensitive words decision tree[J]. Computer Engineering, 2014, 40(9): 300-304. 10. 3969/ j. issn. 1000-3428. 2014. 09. 060 | |
9 | 付聪,余敦辉,张灵莉. 面向中文敏感词变形体的识别方法研究[J].计算机应用研究, 2019, 36(4):988-991. 10.19734/j.issn.1001-3695.2017.11.0996 |
FU C, YU D H, ZHANG L L. Study on identification method for change from of Chinese sensitive words[J]. Application Research of Computers, 2019, 36(4): 988-991. 10.19734/j.issn.1001-3695.2017.11.0996 | |
10 | 李扬,潘泉,杨涛. 基于短文本情感分析的敏感信息识别[J]. 西安交通大学学报, 2016, 50(9):80-84. 10.7652/xjtuxb201609013 |
LI Y, PAN Q, YANG T. Sensitive information recognition based on short text sentiment analysis[J]. Journal of Xi’an Jiaotong University, 2016, 50(9): 80-84. 10.7652/xjtuxb201609013 | |
11 | 姚艳秋,郑雅雯,吕妍欣. 基于LS-SO算法的情感文本分类方法[J]. 吉林大学学报(理学版), 2019, 57(2):375-379. 10.13413/j.cnki.jdxblxb.2018241 |
YAO Y Q, ZHENG Y W, LYU Y X. Emotional text classification method based on LS-SO algorithm[J]. Journal of Jilin University (Science Edition), 2019, 57(2): 375-379. 10.13413/j.cnki.jdxblxb.2018241 | |
12 | 胡思才,孙界平,琚生根,等. 基于扩展的情感词典和卡方模型的中文情感特征选择方法[J]. 四川大学学报(自然科学版), 2019, 56(1):37-44. |
HU S C, SUN J P, JU S G, et al. Chinese emotion feature selection method based on the extended emotion dictionary and the chi-square model[J]. Journal of Sichuan University (Natural Science Edition), 2019, 56(1): 37-44. | |
13 | 明弋洋,刘晓洁. 基于短语级情感分析的不良信息检测方法[J]. 四川大学学报(自然科学版), 2019, 56(6):1042-1048. |
MING Y Y, LIU X J. Sensitive information detection based on phrase-level sentiment analysis[J]. Journal of Sichuan University (Natural Science Edition), 2019, 56(6):1042-1048. | |
14 | GUO Y Y, LIU J Y, TANG W W, et al. ExSense: extract sensitive information from unstructured data[J]. Computers and Security, 2021, 102: No.102156. 10.1016/j.cose.2020.102156 |
15 | WANG Y J, SHEN X J, YANG Y J. The classification of Chinese sensitive information based on BERT-CNN[C]// Proceedings of the 2019 International Symposium on Intelligence Computation and Applications, CCIS 1205. Singapore: Springer, 2020: 269-280. |
16 | 薛朋强,努尔布力,吾守尔·斯拉木. 基于网络文本信息的敏感信息过滤算法[J]. 计算机工程与设计, 2016, 37(9):2447-2452. |
XUE P Q, NURBOL, ISLAM W. Sensitive information filtering algorithm based on text information network[J]. Computer Engineering and Design, 2016, 37(9): 2447-2452. | |
17 | FU Y, YU Y, WU X P. A sensitive word detection method based on variants recognition[C]// Proceedings of the 2019 International Conference on Machine Learning, Big Data and Business Intelligence. Piscataway: IEEE, 2019: 47-52. 10.1109/mlbdbi48998.2019.00017 |
18 | DING M, WANG X, WU C M, et al. Research on automated detection of sensitive information based on BERT[J]. Journal of Physics: Conference Series, 2021, 1757: No.012088. 10.1088/1742-6596/1757/1/012088 |
19 | BIGONHA M A S, FERREIRA K, SOUZA P, et al. The usefulness of software metric thresholds for detection of bad smells and fault prediction[J]. Information and Software Technology, 2019, 115: 79-92. 10.1016/j.infsof.2019.08.005 |
20 | 李丹阳,赵亚慧,罗梦江,等. 基于字典树语言模型的专业课查询文本校对方法[J]. 延边大学学报(自然科学版), 2020, 46(3):260-264. |
LI D Y, ZHAO Y H, LUO M J, et al. Query text proofreading method of professional courses based on trie tree language model[J]. Journal of Yanbian University (Natural Science), 2020, 46(3): 260-264. | |
21 | LOPEZ M M, KALITA J. Deep learning applied to NLP[EB/OL]. (2017-03-09) [2021-03-13].. |
22 | 周飞燕,金林鹏,董军. 卷积神经网络研究综述[J]. 计算机学报, 2017, 40(6):1229-1251. 10.11897/SP.J.1016.2017.01229 |
ZHOU F Y, JIN L P, DONG J. Review of convolutional neural network[J]. Chinese Journal of Computers, 2017, 40(6):1229-1251. 10.11897/SP.J.1016.2017.01229 | |
23 | PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1532-1543. 10.3115/v1/d14-1162 |
24 | MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07) [2021-03-13].. 10.3126/jiee.v3i1.34327 |
25 | JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2017: 427-431. 10.18653/v1/e17-2068 |
26 | SHARMIN S, CHAKMA D. Attention-based convolutional neural network for Bangla sentiment analysis[J]. AI and Society, 2021, 36(1): 381-396. 10.1007/s00146-020-01011-0 |
27 | LIU Y, YANG C Y, YANG J. A graph convolutional network-based sensitive information detection algorithm[J]. Complexity, 2021, 2021: No.6631768. 10.1155/2021/6631768 |
28 | BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155. |
[1] | Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974. |
[2] | Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738. |
[3] | Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892. |
[4] | Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392. |
[5] | Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406. |
[6] | Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594. |
[7] | Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617. |
[8] | Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232. |
[9] | Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072. |
[10] | Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025. |
[11] | Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109. |
[12] | Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199. |
[13] | Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182. |
[14] | Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191. |
[15] | Zexin XU, Lei YANG, Kangshun LI. Shorter long-sequence time series forecasting model [J]. Journal of Computer Applications, 2024, 44(6): 1824-1831. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||