Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (5): 1445-1453.DOI: 10.11772/j.issn.1001-9081.2022040551
Special Issue: 人工智能
• Artificial intelligence • Previous Articles Next Articles
Shunhang CHENG, Zhihua LI(), Tao WEI
Received:
2022-04-25
Revised:
2022-06-30
Accepted:
2022-07-01
Online:
2022-08-05
Published:
2023-05-10
Contact:
Zhihua LI
About author:
CHENG Shunhang, born in 1998, M. S. candidate. His research interests include natural language processing, information security.Supported by:
通讯作者:
李志华
作者简介:
程顺航(1998—),男,湖北荆门人,硕士研究生,主要研究方向:自然语言处理、信息安全基金资助:
CLC Number:
Shunhang CHENG, Zhihua LI, Tao WEI. Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling[J]. Journal of Computer Applications, 2023, 43(5): 1445-1453.
程顺航, 李志华, 魏涛. 融合自举与语义角色标注的威胁情报实体关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1445-1453.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2022040551
专业词汇 | 正则表达式 |
---|---|
URL | (s?[hf]t?tps?%3A%2F%2F\w[\w%-]*?)(?:[^\w%-]|$) |
IP | (?:^|(?![^\d\.]))(?:(?:[ |
MD5 | ([a-f0-9]{32}|[A-F0-9]{32}) |
SHA1 | ([a-f0-9]{40}|[A-F0-9]{40}) |
SHA256 | ([a-f0-9]{64}|[A-F0-9]{64}) |
Tab. 1 Regular expressions of specialized vocabulary
专业词汇 | 正则表达式 |
---|---|
URL | (s?[hf]t?tps?%3A%2F%2F\w[\w%-]*?)(?:[^\w%-]|$) |
IP | (?:^|(?![^\d\.]))(?:(?:[ |
MD5 | ([a-f0-9]{32}|[A-F0-9]{32}) |
SHA1 | ([a-f0-9]{40}|[A-F0-9]{40}) |
SHA256 | ([a-f0-9]{64}|[A-F0-9]{64}) |
实体类型 | 模式匹配方法 | 实体类型 | 模式匹配方法 |
---|---|---|---|
Attacker | 词典+正则表达式 | Location | 词典 |
Malware | 词典+正则表达式 | Type | 词典 |
Cve | 正则表达式 | IoC | 正则表达式 |
Tab. 2 Entity types and pattern matching methods
实体类型 | 模式匹配方法 | 实体类型 | 模式匹配方法 |
---|---|---|---|
Attacker | 词典+正则表达式 | Location | 词典 |
Malware | 词典+正则表达式 | Type | 词典 |
Cve | 正则表达式 | IoC | 正则表达式 |
实施者 | 受事者 | 关系集 |
---|---|---|
Attacker | Malware | 使用 |
Malware | Type | 属于 |
Malware | 相关、下载、控制 | |
Attacker、Malware | Location | 位于、目标 |
Cve | 利用 | |
IoC | 指向 |
Tab. 3 Relation category
实施者 | 受事者 | 关系集 |
---|---|---|
Attacker | Malware | 使用 |
Malware | Type | 属于 |
Malware | 相关、下载、控制 | |
Attacker、Malware | Location | 位于、目标 |
Cve | 利用 | |
IoC | 指向 |
实体标签类型 | 样例 | 数量 |
---|---|---|
Attacker | APT34、海莲花 | 598 |
Malware | Mirai、Moze | 1 057 |
Cve | 永恒之蓝 | 282 |
Location | 中国、东南亚 | 582 |
Type | 病毒、木马 | 417 |
IoC | 77.245.76[.]66 | 1 638 |
Tab. 4 Number of labels per entity
实体标签类型 | 样例 | 数量 |
---|---|---|
Attacker | APT34、海莲花 | 598 |
Malware | Mirai、Moze | 1 057 |
Cve | 永恒之蓝 | 282 |
Location | 中国、东南亚 | 582 |
Type | 病毒、木马 | 417 |
IoC | 77.245.76[.]66 | 1 638 |
关系类型 | 样例 | 数量 |
---|---|---|
(Attacker,Malware) | (海莲花,使用,Miria) | 96 |
(Attacker,Location) | (海莲花,目标,中国) | 212 |
(Malware,Location) | (Mirai,目标,中国) | 58 |
(Attacker,Cve) | (海莲花,利用,永恒之蓝) | 24 |
(Malware,Cve) | (Mirai,利用,永恒之蓝) | 12 |
(Malware,Type) | (Miria,属于,木马) | 174 |
(Malware,Malware) | (Miria,相关,Moze) | 541 |
(Attacker,IoC) | (海莲花,指向,77.245.76[.]66) | 812 |
(Malware,IoC) | (Miria,指向,77.245.76[.]66) | 1 031 |
Tab. 5 Number of relations
关系类型 | 样例 | 数量 |
---|---|---|
(Attacker,Malware) | (海莲花,使用,Miria) | 96 |
(Attacker,Location) | (海莲花,目标,中国) | 212 |
(Malware,Location) | (Mirai,目标,中国) | 58 |
(Attacker,Cve) | (海莲花,利用,永恒之蓝) | 24 |
(Malware,Cve) | (Mirai,利用,永恒之蓝) | 12 |
(Malware,Type) | (Miria,属于,木马) | 174 |
(Malware,Malware) | (Miria,相关,Moze) | 541 |
(Attacker,IoC) | (海莲花,指向,77.245.76[.]66) | 812 |
(Malware,IoC) | (Miria,指向,77.245.76[.]66) | 1 031 |
软/硬件 | 配置 | 软/硬件 | 配置 |
---|---|---|---|
系统 | Windows 10 | 开发语言 | Python 3.7 |
内存 | 32 GB | 深度学习框架 | Pytorch 1.7.1 |
GPU | NVIDIA GTX 1660 |
Tab. 6 Experimental environment configuration
软/硬件 | 配置 | 软/硬件 | 配置 |
---|---|---|---|
系统 | Windows 10 | 开发语言 | Python 3.7 |
内存 | 32 GB | 深度学习框架 | Pytorch 1.7.1 |
GPU | NVIDIA GTX 1660 |
参数类型 | 含义 | 值 |
---|---|---|
Batch_size | 批处理数据量 | 8 |
Epoch_num | 训练次数 | 10 |
Hidden_size | LSTM的隐藏层维度 | 200 |
Word_vector_dim | 由BERT模型计算所得的词向量维度 | 768 |
Phrase_vector_dim | 由Word2Vec模型计算所得的词组向量维度 | 200 |
Dropout_rate | dropout比率 | 0.5 |
Learning_rate | 学习率 | |
Sliding_window | 滑动窗口大小 | 5 |
Bootstrapping_num | 总迭代次数 | 8 |
Tab. 7 Model parameter setting
参数类型 | 含义 | 值 |
---|---|---|
Batch_size | 批处理数据量 | 8 |
Epoch_num | 训练次数 | 10 |
Hidden_size | LSTM的隐藏层维度 | 200 |
Word_vector_dim | 由BERT模型计算所得的词向量维度 | 768 |
Phrase_vector_dim | 由Word2Vec模型计算所得的词组向量维度 | 200 |
Dropout_rate | dropout比率 | 0.5 |
Learning_rate | 学习率 | |
Sliding_window | 滑动窗口大小 | 5 |
Bootstrapping_num | 总迭代次数 | 8 |
算法 | 样本规模 | 评估指标 | ||
---|---|---|---|---|
召回率 | 精确率 | F1值 | ||
Bootstrapping+模式匹配[ | 少样本 | 0.48 | 0.84 | 0.61 |
Watson Konowledge Studio[ | 少样本 | 0.63 | 0.74 | 0.68 |
RDF-CRF[ | 少样本 | 0.78 | 0.86 | 0.82 |
NER-IBS | 少样本 | 0.81 | 0.87 | 0.84 |
BiLSTM-CRF[ | 大规模 样本 | 0.83 | 0.90 | 0.87 |
BiLSTM-Attention-CRF+词典[ | 大规模 样本 | 0.87 | 0.90 | 0.88 |
Tab. 8 Comparison of experimental results of different algorithms
算法 | 样本规模 | 评估指标 | ||
---|---|---|---|---|
召回率 | 精确率 | F1值 | ||
Bootstrapping+模式匹配[ | 少样本 | 0.48 | 0.84 | 0.61 |
Watson Konowledge Studio[ | 少样本 | 0.63 | 0.74 | 0.68 |
RDF-CRF[ | 少样本 | 0.78 | 0.86 | 0.82 |
NER-IBS | 少样本 | 0.81 | 0.87 | 0.84 |
BiLSTM-CRF[ | 大规模 样本 | 0.83 | 0.90 | 0.87 |
BiLSTM-Attention-CRF+词典[ | 大规模 样本 | 0.87 | 0.90 | 0.88 |
类型 | 召回率 | 精确率 | F1值 |
---|---|---|---|
无实体评估模型 | 0.85 | 0.83 | 0.81 |
有实体评估模型 | 0.81 | 0.87 | 0.84 |
Tab. 9 Ablation experimental results
类型 | 召回率 | 精确率 | F1值 |
---|---|---|---|
无实体评估模型 | 0.85 | 0.83 | 0.81 |
有实体评估模型 | 0.81 | 0.87 | 0.84 |
类型 | 召回率 | 精确率 | F1值 |
---|---|---|---|
无类别抽取 | 0.98 | 0.91 | 0.94 |
关系分类 | 0.74 | 0.70 | 0.71 |
Tab. 10 Experimental results of relation extraction and classification
类型 | 召回率 | 精确率 | F1值 |
---|---|---|---|
无类别抽取 | 0.98 | 0.91 | 0.94 |
关系分类 | 0.74 | 0.70 | 0.71 |
1 | 李建华. 网络空间威胁情报感知、共享与分析技术综述[J]. 网络与信息安全学报, 2016, 2(2): 16-29. 10.11959/j.issn.2096-109x.2016.00028 |
LI J H. Overview of the technologies of threat intelligence sensing, sharing and analysis in cyber space[J]. Chinese Journal of Network and Information Security, 2016, 2(2): 16-29. 10.11959/j.issn.2096-109x.2016.00028 | |
2 | GRISHMAN R. Twenty-five years of information extraction[J]. Natural Language Engineering, 2019, 25(6): 677-692. 10.1017/s1351324919000512 |
3 | 李志义,沈之锐. 基于自然标注的网页信息抽取研究[J]. 情报学报, 2013, 32(8): 853-859. 10.3772/j.issn.1000-0135.2013.08.008 |
LI Z Y, SHEN Z R. Web information extraction study based on natural annotation[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(8): 853-859. 10.3772/j.issn.1000-0135.2013.08.008 | |
4 | 倪晓华. 电子病历信息抽取系统的设计与实现[D]. 南京:东南大学, 2019: 30-34. 10.1109/itnec.2019.8729548 |
NI X H. Design and implementation of electronic medical record information extraction system[D]. Nanjing: Southeast University, 2019: 30-34. 10.1109/itnec.2019.8729548 | |
5 | ZHOU G D, SU J. Named entity recognition using an HMM-based chunk tagger[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, PA: ACL, 2002: 473-480. 10.3115/1073083.1073163 |
6 | RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[EB/OL]. [2022-02-05]. . |
7 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 4171-4186. 10.18653/v1/n18-2 |
8 | 张秋颖,傅洛伊,王新兵. 基于BERT-BiLSTM-CRF的学者主页信息抽取[J]. 计算机应用研究, 2020, 37(S1):47-49. |
ZHANG Q Y, FU L Y, WANG X B. Information extraction from scholar homepage based on BERT-BiLSTM-CRF[J]. Application Research of Computers, 2020, 37(S1): 47-49. | |
9 | 王扬,郑阳,杨青,等. 基于联合序列标注深度学习的层级信息抽取[J]. 计算机应用与软件, 2021, 38(8):167-174. 10.3969/j.issn.1000-386x.2021.08.026 |
WANG Y, ZHENG Y, YANG Q, et al. Hierarchical information extraction method based on joint sequence annotation[J]. Computer Applications and Software, 2021, 38(8): 167-174. 10.3969/j.issn.1000-386x.2021.08.026 | |
10 | LIU Z H, WINATA G I, FUNG P. Zero-resource cross-domain named entity recognition[C]// Proceedings of the 5th Workshop on Representation Learning for NLP. Stroudsburg, PA: ACL, 2020:1-6. 10.18653/v1/2020.repl4nlp-1.1 |
11 | LIAO X J, YUAN K, WANG X F, et al. Acing the IOC game: toward automatic discovery and analysis of open-source cyber threat intelligence[C]// Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 755-766. 10.1145/2976749.2978315 |
12 | LONG Z, TAN L Z, ZHOU S P, et al. Collecting indicators of compromise from unstructured text of cybersecurity articles using neural-based sequence labelling[C]// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway: IEEE, 2019: 1-8. 10.1109/ijcnn.2019.8852142 |
13 | DIONÍSIO N, ALVES F, FERREIRA P M, et al. Cyberthreat detection from Twitter using deep neural networks[C]// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway: IEEE, 2019: 1-8. 10.1109/ijcnn.2019.8852475 |
14 | QIN Y, SHEN G W, ZHAO W B, et al. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF[J]. Frontiers of Information Technology and Electronic Engineering, 2019, 20(6): 872-884. 10.1631/fitee.1800520 |
15 | GAO C, ZHANG X, LIU H. Data and knowledge-driven named entity recognition for cyber security[J]. Cybersecurity, 2021, 4: No.9. 10.1186/s42400-021-00072-y |
16 | McNEIL N, BRIDGES R A, M Det al IANNACONE. PACE: pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts[C]// Proceedings of the 12th International Conference on Machine Learning and Applications. Piscataway: IEEE, 2013: 60-65. 10.1109/icmla.2013.106 |
17 | GEORGESCU T M, IANCU B, ZURINI M. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks[J]. Sensors, 2019, 19(15): No.3380. 10.3390/s19153380 |
18 | YI F, JIANG B, WANG L, et al. Cybersecurity named entity recognition using multi-modal ensemble learning[J]. IEEE Access, 2020, 8: 63214-63224. 10.1109/access.2020.2984582 |
19 | SHINYAMA Y. PDFMiner[EB/OL]. (2019-11-25) [2022-01-19]. . |
20 | WARD J. HTMLParser[EB/OL]. (2013-03-01) [2022-02-10]. . 10.36866/pn.91.14a |
21 | HE H, CHOI J D. The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 5555-5577. 10.18653/v1/2021.emnlp-main.451 |
22 | MILLER G A. WordNet: a lexical database for English[J]. Communications of the ACM, 1995, 38(11): 39-41. 10.1145/219717.219748 |
23 | OASIS Open. Introduction to STIX[EB/OL]. [2022-01-07]. . |
24 | STROM B E, APPLEBAUM A, MILLER D P, et al. MITRE ATT&CK: design and philosophy[R/OL]. [2022-01-12]. . |
25 | SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117. 10.1016/j.neunet.2014.09.003 |
26 | SUN J Y. Jieba[EB/OL]. (2020-01-20) [2022-03-12]. . |
27 | BROWN P F, PIETRA V J D, DESOUZA P V, et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992, 18(4): 467-479. |
28 | CHE W X, FENG Y L, QIN L B, et al. N-LTP: an open-source neural language technology platform for Chinese[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 42-49. 10.18653/v1/2021.emnlp-demo.6 |
29 | 安恒情报中心. 疑似海莲花新活动,攻击目标似为国内大型企业(2019-11-13)[EB/OL]. (2021-01-08) [2022-03-12]. . 10.1017/9781009207898.003 |
Anheng Information Center. Suspected a new activity of OceanLotus, and the target seems to be a large domestic enterprise (2019-11-13)[EB/OL]. (2021-01-08) [2022-03-12]. . 10.1017/9781009207898.003 |
[1] | Huanliang SUN, Siyi WANG, Junling LIU, Jingke XU. Help-seeking information extraction model for flood event in social media data [J]. Journal of Computer Applications, 2024, 44(8): 2437-2445. |
[2] | Yubo ZHAO, Liping ZHANG, Sheng YAN, Min HOU, Mao GAO. Relation extraction between discipline knowledge entities based on improved piecewise convolutional neural network and knowledge distillation [J]. Journal of Computer Applications, 2024, 44(8): 2421-2429. |
[3] | Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025. |
[4] | Yuan TANG, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Relation extraction model based on multi-scale hybrid attention convolutional neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2011-2017. |
[5] | Youren YU, Yangsen ZHANG, Yuru JIANG, Gaijuan HUANG. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information [J]. Journal of Computer Applications, 2024, 44(6): 1706-1712. |
[6] | Chao WEI, Yanping CHEN, Kai WANG, Yongbin QIN, Ruizhang HUANG. Relation extraction method based on mask prompt and gated memory network calibration [J]. Journal of Computer Applications, 2024, 44(6): 1713-1719. |
[7] | Quan YUAN, Changping CHEN, Ze CHEN, Linfeng ZHAN. Twice attention mechanism distantly supervised relation extraction based on BERT [J]. Journal of Computer Applications, 2024, 44(4): 1080-1085. |
[8] | Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708. |
[9] | Xinran LUO, Tianrui LI, Zhen JIA. Chinese medical named entity recognition based on self-attention mechanism and lexicon enhancement [J]. Journal of Computer Applications, 2024, 44(2): 385-392. |
[10] | Ziqi HUANG, Jianpeng HU. Entity category enhanced nested named entity recognition in automotive domain [J]. Journal of Computer Applications, 2024, 44(2): 377-384. |
[11] | Andi GUO, Zhen JIA, Tianrui LI. High-precision entity and relation extraction in medical domain based on pseudo-entity data augmentation [J]. Journal of Computer Applications, 2024, 44(2): 393-402. |
[12] | Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network [J]. Journal of Computer Applications, 2023, 43(8): 2406-2411. |
[13] | Kezheng CHEN, Xiaoran GUO, Yong ZHONG, Zhenping LI. Relation extraction method based on negative training and transfer learning [J]. Journal of Computer Applications, 2023, 43(8): 2426-2430. |
[14] | Menglin HUANG, Lei DUAN, Yuanhao ZHANG, Peiyan WANG, Renhao LI. Prompt learning based unsupervised relation extraction model [J]. Journal of Computer Applications, 2023, 43(7): 2010-2016. |
[15] | Jingsheng LEI, Kaijun LA, Shengying YANG, Yi WU. Joint entity and relation extraction based on contextual semantic enhancement [J]. Journal of Computer Applications, 2023, 43(5): 1438-1444. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||