Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling

doi:10.11772/j.issn.1001-9081.2022040551

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (5): 1445-1453.DOI: 10.11772/j.issn.1001-9081.2022040551

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling

Shunhang CHENG, Zhihua LI(), Tao WEI

School of Artificial Intelligence and Computer Science，Jiangnan University，Wuxi Jiangsu 214122，China

Received:2022-04-25 Revised:2022-06-30 Accepted:2022-07-01 Online:2022-08-05 Published:2023-05-10
Contact: Zhihua LI
About author:CHENG Shunhang， born in 1998， M. S. candidate. His research interests include natural language processing， information security.
LI Zhihua， born in 1969， Ph. D.， professor. His research interests include key technologies of end edge cloud and information security and their interdisciplinary research with frontier disciplines such as artificial intelligence.
WEI Tao， born in 1998， M. S. candidate. His research interests include information system analysis， information system security.
Supported by:
Intelligent Manufacturing Program of Ministry of Industry and Information Technology(ZH-XZ-180004);Fundamental Research Funds for Central Universities(JUSRP211A41)

融合自举与语义角色标注的威胁情报实体关系抽取方法

程顺航, 李志华(), 魏涛

江南大学人工智能与计算机学院，江苏无锡 214122

通讯作者: 李志华
作者简介:程顺航（1998—），男，湖北荆门人，硕士研究生，主要研究方向：自然语言处理、信息安全
李志华（1969—），男，湖南保靖人，教授，博士，主要研究方向：端边云关键技术与信息安全，及其与人工智能等前沿学科的交叉研究 jswxzhli@aliyun.com
魏涛（1998—），男，湖北襄阳人，硕士研究生，主要研究方向：信息系统分析、信息系统安全。
基金资助:
工业和信息化部智能制造项目(ZH?XZ?180004);中央高校基本科研业务费专项资金资助项目(JUSRP211A41)

Abstract

Abstract:

To efficiently and automatically mine threat intelligence entities and their relations in open source heterogeneous big data， a Threat Intelligence Entity Relation Extraction （TIERE） method was proposed. Firstly， a data preprocessing method was studied and presented by analyzing the characteristics of the open source cyber security reports. Then， an Improved BootStrapping-based Named Entity Recognition （NER-IBS） algorithm and a Semantic Role Labeling-based Relation Extraction （RE-SRL） algorithm were developed for the problems of high text complexity and small standard dataset in cyber security field. Initial seeds were constructed by using a small number of samples and rules， the entities in the unstructured text were mined through iterative training， and the relations between entities were mined by the strategy of constructing semantic roles. Experimental results show that on the few-shot cyber security information extraction dataset， the F1 value of the NER-IBS algorithm is 84%， which is 2 percentage points higher than that of the RDF-CRF （Regular expression and Dictionary combined with Feature templates as well as Conditional Random Field） algorithm， and the F1 value of RE-SRL algorithm for uncategorized relation extraction is 94%， proving that TIERE method has efficient entity and relation extraction capability.

Key words: entity recognition, relation extraction, threat intelligence, bootstrapping, semantic role labeling

摘要：

为高效地自动挖掘开源异构大数据中的威胁情报实体和关系，提出一种威胁情报实体关系抽取（TIERE）方法。首先，通过分析开源网络安全报告的特点，研究并提出一种数据预处理方法；然后，针对网络安全领域文本复杂度高、标准数据样本集少的问题，提出基于改进自举法的命名实体识别（NER-IBS）算法和基于语义角色标注的关系抽取（RE-SRL）算法。利用少量样本和规则构建初始种子，通过迭代训练挖掘非结构化文本中的实体，并通过构建语义角色的策略挖掘实体之间的关系。实验结果表明，在少样本网络安全信息抽取数据集上，NER-IBS算法的F1值为84%，与RDF-CRF （Regular expression and Dictionary combined with Feature templates as well as Conditional Random Field）算法相比提高了2个百分点，且RE-SRL算法对于无类别关系抽取的F1值为94%，说明TIERE方法具有高效的实体关系抽取能力。

关键词: 实体识别, 关系抽取, 威胁情报, 自举法, 语义角色标注

CLC Number:

TP301.6

Shunhang CHENG, Zhihua LI, Tao WEI. Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling[J]. Journal of Computer Applications, 2023, 43(5): 1445-1453.

程顺航, 李志华, 魏涛. 融合自举与语义角色标注的威胁情报实体关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1445-1453.

Figures/Tables 15

Fig. 1 Data preprocessing method

Tab. 1 Regular expressions of specialized vocabulary

专业词汇	正则表达式
URL	（s？［hf］t？tps？%3A%2F%2F\w［\w%-］*？）（？：［^\w%-］\|＄）
IP	（？：^\|（？！［^\d\.］））（？：（？：［1-9］？\d\|1\d\d\|2［0-4］\d\|25［0-5］）［\［\（\\］？\.［\］\）］？）｛3｝（？：［1-9］？\d\|1\d\d\|2［0-4］\d\|25［0-5］）
MD5	（［a-f0-9］｛32｝\|［A-F0-9］｛32｝）
SHA1	（［a-f0-9］｛40｝\|［A-F0-9］｛40｝）
SHA256	（［a-f0-9］｛64｝\|［A-F0-9］｛64｝）

Fig. 2 Improved strategy of Bootstrapping algorithm

Tab. 2 Entity types and pattern matching methods

实体类型	模式匹配方法	实体类型	模式匹配方法
Attacker	词典+正则表达式	Location	词典
Malware	词典+正则表达式	Type	词典
Cve	正则表达式	IoC	正则表达式

Fig. 3 Entity evaluation model example

Fig. 4 Logic architecture of RE-SEL algorithm

Tab. 3 Relation category

实施者	受事者	关系集
Attacker	Malware	使用
Malware	Type	属于
Malware	Malware	相关、下载、控制
Attacker、Malware	Location	位于、目标
	Cve	利用
	IoC	指向

Tab. 4 Number of labels per entity

实体标签类型	样例	数量
Attacker	APT34、海莲花	598
Malware	Mirai、Moze	1 057
Cve	永恒之蓝	282
Location	中国、东南亚	582
Type	病毒、木马	417
IoC	77.245.76［.］66	1 638

Tab. 5 Number of relations

关系类型	样例	数量
（Attacker，Malware）	（海莲花，使用，Miria）	96
（Attacker，Location）	（海莲花，目标，中国）	212
（Malware，Location）	（Mirai，目标，中国）	58
（Attacker，Cve）	（海莲花，利用，永恒之蓝）	24
（Malware，Cve）	（Mirai，利用，永恒之蓝）	12
（Malware，Type）	（Miria，属于，木马）	174
（Malware，Malware）	（Miria，相关，Moze）	541
（Attacker，IoC）	（海莲花，指向，77.245.76［.］66）	812
（Malware，IoC）	（Miria，指向，77.245.76［.］66）	1 031

Tab. 6 Experimental environment configuration

软/硬件	配置	软/硬件	配置
系统	Windows 10	开发语言	Python 3.7
内存	32 GB	深度学习框架	Pytorch 1.7.1
GPU	NVIDIA GTX 1660

Tab. 7 Model parameter setting

参数类型	含义	值
Batch_size	批处理数据量	8
Epoch_num	训练次数	10
Hidden_size	LSTM的隐藏层维度	200
Word_vector_dim	由BERT模型计算所得的词向量维度	768
Phrase_vector_dim	由Word2Vec模型计算所得的词组向量维度	200
Dropout_rate	dropout比率	0.5
Learning_rate	学习率	$10 - 5$
Sliding_window	滑动窗口大小	5
Bootstrapping_num	总迭代次数	8

Tab. 7 Model parameter setting

参数类型	含义	值
Batch_size	批处理数据量	8
Epoch_num	训练次数	10
Hidden_size	LSTM的隐藏层维度	200
Word_vector_dim	由BERT模型计算所得的词向量维度	768
Phrase_vector_dim	由Word2Vec模型计算所得的词组向量维度	200
Dropout_rate	dropout比率	0.5
Learning_rate	学习率	$10 - 5$
Sliding_window	滑动窗口大小	5
Bootstrapping_num	总迭代次数	8

Tab. 8 Comparison of experimental results of different algorithms

算法	样本规模	评估指标
算法	样本规模	召回率	精确率	F1值
Bootstrapping+模式匹配^［16］	少样本	0.48	0.84	0.61
Watson Konowledge Studio^［17］	少样本	0.63	0.74	0.68
RDF-CRF^［18］	少样本	0.78	0.86	0.82
NER-IBS	少样本	0.81	0.87	0.84
BiLSTM-CRF^［13］	大规模样本	0.83	0.90	0.87
BiLSTM-Attention-CRF+词典^［15］	大规模样本	0.87	0.90	0.88

Fig. 5 Experimental results of each entity under different iteration numbers

Tab. 9 Ablation experimental results

类型	召回率	精确率	F1值
无实体评估模型	0.85	0.83	0.81
有实体评估模型	0.81	0.87	0.84

Tab. 10 Experimental results of relation extraction and classification

类型	召回率	精确率	F1值
无类别抽取	0.98	0.91	0.94
关系分类	0.74	0.70	0.71

References 29

1	李建华. 网络空间威胁情报感知、共享与分析技术综述［J］. 网络与信息安全学报， 2016， 2（2）： 16-29. 10.11959/j.issn.2096-109x.2016.00028
	LI J H. Overview of the technologies of threat intelligence sensing， sharing and analysis in cyber space［J］. Chinese Journal of Network and Information Security， 2016， 2（2）： 16-29. 10.11959/j.issn.2096-109x.2016.00028
2	GRISHMAN R. Twenty-five years of information extraction［J］. Natural Language Engineering， 2019， 25（6）： 677-692. 10.1017/s1351324919000512
3	李志义，沈之锐. 基于自然标注的网页信息抽取研究［J］. 情报学报， 2013， 32（8）： 853-859. 10.3772/j.issn.1000-0135.2013.08.008
	LI Z Y， SHEN Z R. Web information extraction study based on natural annotation［J］. Journal of the China Society for Scientific and Technical Information， 2013， 32（8）： 853-859. 10.3772/j.issn.1000-0135.2013.08.008
4	倪晓华. 电子病历信息抽取系统的设计与实现［D］. 南京：东南大学， 2019： 30-34. 10.1109/itnec.2019.8729548
	NI X H. Design and implementation of electronic medical record information extraction system［D］. Nanjing： Southeast University， 2019： 30-34. 10.1109/itnec.2019.8729548
5	ZHOU G D， SU J. Named entity recognition using an HMM-based chunk tagger［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia， PA： ACL， 2002： 473-480. 10.3115/1073083.1073163
6	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners［EB/OL］. ［2022-02-05］. .
7	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
8	张秋颖，傅洛伊，王新兵. 基于BERT-BiLSTM-CRF的学者主页信息抽取［J］. 计算机应用研究， 2020， 37（S1）：47-49.
	ZHANG Q Y， FU L Y， WANG X B. Information extraction from scholar homepage based on BERT-BiLSTM-CRF［J］. Application Research of Computers， 2020， 37（S1）： 47-49.
9	王扬，郑阳，杨青，等. 基于联合序列标注深度学习的层级信息抽取［J］. 计算机应用与软件， 2021， 38（8）：167-174. 10.3969/j.issn.1000-386x.2021.08.026
	WANG Y， ZHENG Y， YANG Q， et al. Hierarchical information extraction method based on joint sequence annotation［J］. Computer Applications and Software， 2021， 38（8）： 167-174. 10.3969/j.issn.1000-386x.2021.08.026
10	LIU Z H， WINATA G I， FUNG P. Zero-resource cross-domain named entity recognition［C］// Proceedings of the 5th Workshop on Representation Learning for NLP. Stroudsburg， PA： ACL， 2020：1-6. 10.18653/v1/2020.repl4nlp-1.1
11	LIAO X J， YUAN K， WANG X F， et al. Acing the IOC game： toward automatic discovery and analysis of open-source cyber threat intelligence［C］// Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2016： 755-766. 10.1145/2976749.2978315
12	LONG Z， TAN L Z， ZHOU S P， et al. Collecting indicators of compromise from unstructured text of cybersecurity articles using neural-based sequence labelling［C］// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway： IEEE， 2019： 1-8. 10.1109/ijcnn.2019.8852142
13	DIONÍSIO N， ALVES F， FERREIRA P M， et al. Cyberthreat detection from Twitter using deep neural networks［C］// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway： IEEE， 2019： 1-8. 10.1109/ijcnn.2019.8852475
14	QIN Y， SHEN G W， ZHAO W B， et al. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF［J］. Frontiers of Information Technology and Electronic Engineering， 2019， 20（6）： 872-884. 10.1631/fitee.1800520
15	GAO C， ZHANG X， LIU H. Data and knowledge-driven named entity recognition for cyber security［J］. Cybersecurity， 2021， 4： No.9. 10.1186/s42400-021-00072-y
16	McNEIL N， BRIDGES R A， M Det al IANNACONE. PACE： pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts［C］// Proceedings of the 12th International Conference on Machine Learning and Applications. Piscataway： IEEE， 2013： 60-65. 10.1109/icmla.2013.106
17	GEORGESCU T M， IANCU B， ZURINI M. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks［J］. Sensors， 2019， 19（15）： No.3380. 10.3390/s19153380
18	YI F， JIANG B， WANG L， et al. Cybersecurity named entity recognition using multi-modal ensemble learning［J］. IEEE Access， 2020， 8： 63214-63224. 10.1109/access.2020.2984582
19	SHINYAMA Y. PDFMiner［EB/OL］. （2019-11-25）［2022-01-19］. .
20	WARD J. HTMLParser［EB/OL］. （2013-03-01）［2022-02-10］. . 10.36866/pn.91.14a
21	HE H， CHOI J D. The stem cell hypothesis： dilemma behind multi-task learning with transformer encoders［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2021： 5555-5577. 10.18653/v1/2021.emnlp-main.451
22	MILLER G A. WordNet： a lexical database for English［J］. Communications of the ACM， 1995， 38（11）： 39-41. 10.1145/219717.219748
23	OASIS Open. Introduction to STIX［EB/OL］. ［2022-01-07］. .
24	STROM B E， APPLEBAUM A， MILLER D P， et al. MITRE ATT&CK： design and philosophy［R/OL］. ［2022-01-12］. .
25	SCHMIDHUBER J. Deep learning in neural networks： an overview［J］. Neural Networks， 2015， 61： 85-117. 10.1016/j.neunet.2014.09.003
26	SUN J Y. Jieba［EB/OL］. （2020-01-20）［2022-03-12］. .
27	BROWN P F， PIETRA V J D， DESOUZA P V， et al. Class-based n-gram models of natural language［J］. Computational Linguistics， 1992， 18（4）： 467-479.
28	CHE W X， FENG Y L， QIN L B， et al. N-LTP： an open-source neural language technology platform for Chinese［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing： System Demonstrations. Stroudsburg， PA： ACL， 2021： 42-49. 10.18653/v1/2021.emnlp-demo.6
29	安恒情报中心. 疑似海莲花新活动，攻击目标似为国内大型企业（2019-11-13）［EB/OL］. （2021-01-08）［2022-03-12］. . 10.1017/9781009207898.003
	Anheng Information Center. Suspected a new activity of OceanLotus， and the target seems to be a large domestic enterprise （2019-11-13）［EB/OL］. （2021-01-08）［2022-03-12］. . 10.1017/9781009207898.003

[1]	Huanliang SUN, Siyi WANG, Junling LIU, Jingke XU. Help-seeking information extraction model for flood event in social media data [J]. Journal of Computer Applications, 2024, 44(8): 2437-2445.
[2]	Yubo ZHAO, Liping ZHANG, Sheng YAN, Min HOU, Mao GAO. Relation extraction between discipline knowledge entities based on improved piecewise convolutional neural network and knowledge distillation [J]. Journal of Computer Applications, 2024, 44(8): 2421-2429.
[3]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.
[4]	Yuan TANG, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Relation extraction model based on multi-scale hybrid attention convolutional neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2011-2017.
[5]	Youren YU, Yangsen ZHANG, Yuru JIANG, Gaijuan HUANG. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information [J]. Journal of Computer Applications, 2024, 44(6): 1706-1712.
[6]	Chao WEI, Yanping CHEN, Kai WANG, Yongbin QIN, Ruizhang HUANG. Relation extraction method based on mask prompt and gated memory network calibration [J]. Journal of Computer Applications, 2024, 44(6): 1713-1719.
[7]	Quan YUAN, Changping CHEN, Ze CHEN, Linfeng ZHAN. Twice attention mechanism distantly supervised relation extraction based on BERT [J]. Journal of Computer Applications, 2024, 44(4): 1080-1085.
[8]	Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708.
[9]	Xinran LUO, Tianrui LI, Zhen JIA. Chinese medical named entity recognition based on self-attention mechanism and lexicon enhancement [J]. Journal of Computer Applications, 2024, 44(2): 385-392.
[10]	Ziqi HUANG, Jianpeng HU. Entity category enhanced nested named entity recognition in automotive domain [J]. Journal of Computer Applications, 2024, 44(2): 377-384.
[11]	Andi GUO, Zhen JIA, Tianrui LI. High-precision entity and relation extraction in medical domain based on pseudo-entity data augmentation [J]. Journal of Computer Applications, 2024, 44(2): 393-402.
[12]	Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network [J]. Journal of Computer Applications, 2023, 43(8): 2406-2411.
[13]	Kezheng CHEN, Xiaoran GUO, Yong ZHONG, Zhenping LI. Relation extraction method based on negative training and transfer learning [J]. Journal of Computer Applications, 2023, 43(8): 2426-2430.
[14]	Menglin HUANG, Lei DUAN, Yuanhao ZHANG, Peiyan WANG, Renhao LI. Prompt learning based unsupervised relation extraction model [J]. Journal of Computer Applications, 2023, 43(7): 2010-2016.
[15]	Jingsheng LEI, Kaijun LA, Shengying YANG, Yi WU. Joint entity and relation extraction based on contextual semantic enhancement [J]. Journal of Computer Applications, 2023, 43(5): 1438-1444.

Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling

融合自举与语义角色标注的威胁情报实体关系抽取方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 15

References 29

Related Articles 15

Recommended Articles

Metrics