融合自举与语义角色标注的威胁情报实体关系抽取方法

doi:10.11772/j.issn.1001-9081.2022040551

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (5): 1445-1453.DOI: 10.11772/j.issn.1001-9081.2022040551

融合自举与语义角色标注的威胁情报实体关系抽取方法

程顺航, 李志华(), 魏涛

江南大学人工智能与计算机学院，江苏无锡 214122

收稿日期:2022-04-25 修回日期:2022-06-30 接受日期:2022-07-01 发布日期:2022-08-05 出版日期:2023-05-10
通讯作者: 李志华
作者简介:程顺航（1998—），男，湖北荆门人，硕士研究生，主要研究方向：自然语言处理、信息安全
李志华（1969—），男，湖南保靖人，教授，博士，主要研究方向：端边云关键技术与信息安全，及其与人工智能等前沿学科的交叉研究 jswxzhli@aliyun.com
魏涛（1998—），男，湖北襄阳人，硕士研究生，主要研究方向：信息系统分析、信息系统安全。
基金资助:
工业和信息化部智能制造项目(ZH?XZ?180004);中央高校基本科研业务费专项资金资助项目(JUSRP211A41)

Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling

Shunhang CHENG, Zhihua LI(), Tao WEI

School of Artificial Intelligence and Computer Science，Jiangnan University，Wuxi Jiangsu 214122，China

Received:2022-04-25 Revised:2022-06-30 Accepted:2022-07-01 Online:2022-08-05 Published:2023-05-10
Contact: Zhihua LI
About author:CHENG Shunhang， born in 1998， M. S. candidate. His research interests include natural language processing， information security.
LI Zhihua， born in 1969， Ph. D.， professor. His research interests include key technologies of end edge cloud and information security and their interdisciplinary research with frontier disciplines such as artificial intelligence.
WEI Tao， born in 1998， M. S. candidate. His research interests include information system analysis， information system security.
Supported by:
Intelligent Manufacturing Program of Ministry of Industry and Information Technology(ZH-XZ-180004);Fundamental Research Funds for Central Universities(JUSRP211A41)

摘要/Abstract

摘要：

为高效地自动挖掘开源异构大数据中的威胁情报实体和关系，提出一种威胁情报实体关系抽取（TIERE）方法。首先，通过分析开源网络安全报告的特点，研究并提出一种数据预处理方法；然后，针对网络安全领域文本复杂度高、标准数据样本集少的问题，提出基于改进自举法的命名实体识别（NER-IBS）算法和基于语义角色标注的关系抽取（RE-SRL）算法。利用少量样本和规则构建初始种子，通过迭代训练挖掘非结构化文本中的实体，并通过构建语义角色的策略挖掘实体之间的关系。实验结果表明，在少样本网络安全信息抽取数据集上，NER-IBS算法的F1值为84%，与RDF-CRF （Regular expression and Dictionary combined with Feature templates as well as Conditional Random Field）算法相比提高了2个百分点，且RE-SRL算法对于无类别关系抽取的F1值为94%，说明TIERE方法具有高效的实体关系抽取能力。

关键词: 实体识别, 关系抽取, 威胁情报, 自举法, 语义角色标注

Abstract:

To efficiently and automatically mine threat intelligence entities and their relations in open source heterogeneous big data， a Threat Intelligence Entity Relation Extraction （TIERE） method was proposed. Firstly， a data preprocessing method was studied and presented by analyzing the characteristics of the open source cyber security reports. Then， an Improved BootStrapping-based Named Entity Recognition （NER-IBS） algorithm and a Semantic Role Labeling-based Relation Extraction （RE-SRL） algorithm were developed for the problems of high text complexity and small standard dataset in cyber security field. Initial seeds were constructed by using a small number of samples and rules， the entities in the unstructured text were mined through iterative training， and the relations between entities were mined by the strategy of constructing semantic roles. Experimental results show that on the few-shot cyber security information extraction dataset， the F1 value of the NER-IBS algorithm is 84%， which is 2 percentage points higher than that of the RDF-CRF （Regular expression and Dictionary combined with Feature templates as well as Conditional Random Field） algorithm， and the F1 value of RE-SRL algorithm for uncategorized relation extraction is 94%， proving that TIERE method has efficient entity and relation extraction capability.

Key words: entity recognition, relation extraction, threat intelligence, bootstrapping, semantic role labeling

中图分类号:

TP301.6

程顺航, 李志华, 魏涛. 融合自举与语义角色标注的威胁情报实体关系抽取方法[J]. 计算机应用, 2023, 43(5): 1445-1453.

Shunhang CHENG, Zhihua LI, Tao WEI. Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling[J]. Journal of Computer Applications, 2023, 43(5): 1445-1453.

图/表 15

图1 数据预处理方法

Fig. 1 Data preprocessing method

表1 专业词汇的正则表达式

Tab. 1 Regular expressions of specialized vocabulary

专业词汇	正则表达式
URL	（s？［hf］t？tps？%3A%2F%2F\w［\w%-］*？）（？：［^\w%-］\|＄）
IP	（？：^\|（？！［^\d\.］））（？：（？：［1-9］？\d\|1\d\d\|2［0-4］\d\|25［0-5］）［\［\（\\］？\.［\］\）］？）｛3｝（？：［1-9］？\d\|1\d\d\|2［0-4］\d\|25［0-5］）
MD5	（［a-f0-9］｛32｝\|［A-F0-9］｛32｝）
SHA1	（［a-f0-9］｛40｝\|［A-F0-9］｛40｝）
SHA256	（［a-f0-9］｛64｝\|［A-F0-9］｛64｝）

图2 Bootstrapping算法的改进策略

Fig. 2 Improved strategy of Bootstrapping algorithm

表2 实体类型及模式匹配方法

Tab. 2 Entity types and pattern matching methods

实体类型	模式匹配方法	实体类型	模式匹配方法
Attacker	词典+正则表达式	Location	词典
Malware	词典+正则表达式	Type	词典
Cve	正则表达式	IoC	正则表达式

图3 实体评估模型示例

Fig. 3 Entity evaluation model example

图4 RE-SRL算法的逻辑架构

Fig. 4 Logic architecture of RE-SEL algorithm

表3 关系类别

Tab. 3 Relation category

实施者	受事者	关系集
Attacker	Malware	使用
Malware	Type	属于
Malware	Malware	相关、下载、控制
Attacker、Malware	Location	位于、目标
	Cve	利用
	IoC	指向

表4 各实体标签数量

Tab. 4 Number of labels per entity

实体标签类型	样例	数量
Attacker	APT34、海莲花	598
Malware	Mirai、Moze	1 057
Cve	永恒之蓝	282
Location	中国、东南亚	582
Type	病毒、木马	417
IoC	77.245.76［.］66	1 638

表5 各类关系数量

Tab. 5 Number of relations

关系类型	样例	数量
（Attacker，Malware）	（海莲花，使用，Miria）	96
（Attacker，Location）	（海莲花，目标，中国）	212
（Malware，Location）	（Mirai，目标，中国）	58
（Attacker，Cve）	（海莲花，利用，永恒之蓝）	24
（Malware，Cve）	（Mirai，利用，永恒之蓝）	12
（Malware，Type）	（Miria，属于，木马）	174
（Malware，Malware）	（Miria，相关，Moze）	541
（Attacker，IoC）	（海莲花，指向，77.245.76［.］66）	812
（Malware，IoC）	（Miria，指向，77.245.76［.］66）	1 031

表6 实验环境配置

Tab. 6 Experimental environment configuration

软/硬件	配置	软/硬件	配置
系统	Windows 10	开发语言	Python 3.7
内存	32 GB	深度学习框架	Pytorch 1.7.1
GPU	NVIDIA GTX 1660

表7 模型参数设置

Tab. 7 Model parameter setting

参数类型	含义	值
Batch_size	批处理数据量	8
Epoch_num	训练次数	10
Hidden_size	LSTM的隐藏层维度	200
Word_vector_dim	由BERT模型计算所得的词向量维度	768
Phrase_vector_dim	由Word2Vec模型计算所得的词组向量维度	200
Dropout_rate	dropout比率	0.5
Learning_rate	学习率	$10 - 5$
Sliding_window	滑动窗口大小	5
Bootstrapping_num	总迭代次数	8

表7 模型参数设置

Tab. 7 Model parameter setting

参数类型	含义	值
Batch_size	批处理数据量	8
Epoch_num	训练次数	10
Hidden_size	LSTM的隐藏层维度	200
Word_vector_dim	由BERT模型计算所得的词向量维度	768
Phrase_vector_dim	由Word2Vec模型计算所得的词组向量维度	200
Dropout_rate	dropout比率	0.5
Learning_rate	学习率	$10 - 5$
Sliding_window	滑动窗口大小	5
Bootstrapping_num	总迭代次数	8

表8 不同算法实验结果的对比

Tab. 8 Comparison of experimental results of different algorithms

算法	样本规模	评估指标
算法	样本规模	召回率	精确率	F1值
Bootstrapping+模式匹配^［16］	少样本	0.48	0.84	0.61
Watson Konowledge Studio^［17］	少样本	0.63	0.74	0.68
RDF-CRF^［18］	少样本	0.78	0.86	0.82
NER-IBS	少样本	0.81	0.87	0.84
BiLSTM-CRF^［13］	大规模样本	0.83	0.90	0.87
BiLSTM-Attention-CRF+词典^［15］	大规模样本	0.87	0.90	0.88

图5 各实体不同迭代次数下的实验结果

Fig. 5 Experimental results of each entity under different iteration numbers

表9 消融实验结果

Tab. 9 Ablation experimental results

类型	召回率	精确率	F1值
无实体评估模型	0.85	0.83	0.81
有实体评估模型	0.81	0.87	0.84

表10 关系抽取和分类实验结果

Tab. 10 Experimental results of relation extraction and classification

类型	召回率	精确率	F1值
无类别抽取	0.98	0.91	0.94
关系分类	0.74	0.70	0.71

参考文献 29

1	李建华. 网络空间威胁情报感知、共享与分析技术综述［J］. 网络与信息安全学报， 2016， 2（2）： 16-29. 10.11959/j.issn.2096-109x.2016.00028
	LI J H. Overview of the technologies of threat intelligence sensing， sharing and analysis in cyber space［J］. Chinese Journal of Network and Information Security， 2016， 2（2）： 16-29. 10.11959/j.issn.2096-109x.2016.00028
2	GRISHMAN R. Twenty-five years of information extraction［J］. Natural Language Engineering， 2019， 25（6）： 677-692. 10.1017/s1351324919000512
3	李志义，沈之锐. 基于自然标注的网页信息抽取研究［J］. 情报学报， 2013， 32（8）： 853-859. 10.3772/j.issn.1000-0135.2013.08.008
	LI Z Y， SHEN Z R. Web information extraction study based on natural annotation［J］. Journal of the China Society for Scientific and Technical Information， 2013， 32（8）： 853-859. 10.3772/j.issn.1000-0135.2013.08.008
4	倪晓华. 电子病历信息抽取系统的设计与实现［D］. 南京：东南大学， 2019： 30-34. 10.1109/itnec.2019.8729548
	NI X H. Design and implementation of electronic medical record information extraction system［D］. Nanjing： Southeast University， 2019： 30-34. 10.1109/itnec.2019.8729548
5	ZHOU G D， SU J. Named entity recognition using an HMM-based chunk tagger［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia， PA： ACL， 2002： 473-480. 10.3115/1073083.1073163
6	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners［EB/OL］. ［2022-02-05］. .
7	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
8	张秋颖，傅洛伊，王新兵. 基于BERT-BiLSTM-CRF的学者主页信息抽取［J］. 计算机应用研究， 2020， 37（S1）：47-49.
	ZHANG Q Y， FU L Y， WANG X B. Information extraction from scholar homepage based on BERT-BiLSTM-CRF［J］. Application Research of Computers， 2020， 37（S1）： 47-49.
9	王扬，郑阳，杨青，等. 基于联合序列标注深度学习的层级信息抽取［J］. 计算机应用与软件， 2021， 38（8）：167-174. 10.3969/j.issn.1000-386x.2021.08.026
	WANG Y， ZHENG Y， YANG Q， et al. Hierarchical information extraction method based on joint sequence annotation［J］. Computer Applications and Software， 2021， 38（8）： 167-174. 10.3969/j.issn.1000-386x.2021.08.026
10	LIU Z H， WINATA G I， FUNG P. Zero-resource cross-domain named entity recognition［C］// Proceedings of the 5th Workshop on Representation Learning for NLP. Stroudsburg， PA： ACL， 2020：1-6. 10.18653/v1/2020.repl4nlp-1.1
11	LIAO X J， YUAN K， WANG X F， et al. Acing the IOC game： toward automatic discovery and analysis of open-source cyber threat intelligence［C］// Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2016： 755-766. 10.1145/2976749.2978315
12	LONG Z， TAN L Z， ZHOU S P， et al. Collecting indicators of compromise from unstructured text of cybersecurity articles using neural-based sequence labelling［C］// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway： IEEE， 2019： 1-8. 10.1109/ijcnn.2019.8852142
13	DIONÍSIO N， ALVES F， FERREIRA P M， et al. Cyberthreat detection from Twitter using deep neural networks［C］// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway： IEEE， 2019： 1-8. 10.1109/ijcnn.2019.8852475
14	QIN Y， SHEN G W， ZHAO W B， et al. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF［J］. Frontiers of Information Technology and Electronic Engineering， 2019， 20（6）： 872-884. 10.1631/fitee.1800520
15	GAO C， ZHANG X， LIU H. Data and knowledge-driven named entity recognition for cyber security［J］. Cybersecurity， 2021， 4： No.9. 10.1186/s42400-021-00072-y
16	McNEIL N， BRIDGES R A， M Det al IANNACONE. PACE： pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts［C］// Proceedings of the 12th International Conference on Machine Learning and Applications. Piscataway： IEEE， 2013： 60-65. 10.1109/icmla.2013.106
17	GEORGESCU T M， IANCU B， ZURINI M. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks［J］. Sensors， 2019， 19（15）： No.3380. 10.3390/s19153380
18	YI F， JIANG B， WANG L， et al. Cybersecurity named entity recognition using multi-modal ensemble learning［J］. IEEE Access， 2020， 8： 63214-63224. 10.1109/access.2020.2984582
19	SHINYAMA Y. PDFMiner［EB/OL］. （2019-11-25）［2022-01-19］. .
20	WARD J. HTMLParser［EB/OL］. （2013-03-01）［2022-02-10］. . 10.36866/pn.91.14a
21	HE H， CHOI J D. The stem cell hypothesis： dilemma behind multi-task learning with transformer encoders［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2021： 5555-5577. 10.18653/v1/2021.emnlp-main.451
22	MILLER G A. WordNet： a lexical database for English［J］. Communications of the ACM， 1995， 38（11）： 39-41. 10.1145/219717.219748
23	OASIS Open. Introduction to STIX［EB/OL］. ［2022-01-07］. .
24	STROM B E， APPLEBAUM A， MILLER D P， et al. MITRE ATT&CK： design and philosophy［R/OL］. ［2022-01-12］. .
25	SCHMIDHUBER J. Deep learning in neural networks： an overview［J］. Neural Networks， 2015， 61： 85-117. 10.1016/j.neunet.2014.09.003
26	SUN J Y. Jieba［EB/OL］. （2020-01-20）［2022-03-12］. .
27	BROWN P F， PIETRA V J D， DESOUZA P V， et al. Class-based n-gram models of natural language［J］. Computational Linguistics， 1992， 18（4）： 467-479.
28	CHE W X， FENG Y L， QIN L B， et al. N-LTP： an open-source neural language technology platform for Chinese［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing： System Demonstrations. Stroudsburg， PA： ACL， 2021： 42-49. 10.18653/v1/2021.emnlp-demo.6
29	安恒情报中心. 疑似海莲花新活动，攻击目标似为国内大型企业（2019-11-13）［EB/OL］. （2021-01-08）［2022-03-12］. . 10.1017/9781009207898.003
	Anheng Information Center. Suspected a new activity of OceanLotus， and the target seems to be a large domestic enterprise （2019-11-13）［EB/OL］. （2021-01-08）［2022-03-12］. . 10.1017/9781009207898.003

[1]	雷景生, 剌凯俊, 杨胜英, 吴怡. 基于上下文语义增强的实体关系联合抽取[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1438-1444.
[2]	袁泉, 徐雲鹏, 唐成亮. 基于路径标签的文档级关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1029-1035.
[3]	侯旭东, 滕飞, 张艺. 基于深度自编码的医疗命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2686-2692.
[4]	徐关友, 冯伟森. 基于transformer的python命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2693-2700.
[5]	胡婕, 胡燕, 刘梦赤, 张龑. 基于知识库实体增强BERT模型的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2680-2685.
[6]	左亚尧, 陈皓宇, 陈致然, 洪嘉伟, 陈坤. 融合多语义特征的命名实体识别方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2001-2008.
[7]	韩玉民, 郝晓燕. 基于子词嵌入和相对注意力的材料实体识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1862-1868.
[8]	李昊, 陈艳平, 唐瑞雪, 黄瑞章, 秦永彬, 王国蓉, 谭曦. 基于实体边界组合的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1796-1801.
[9]	张毅, 王爽胜, 何彬, 叶培明, 李克强. 基于BERT的初等数学文本命名实体识别方法[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 433-439.
[10]	袁泉, 薛书鑫. 基于残差收缩网络的关系抽取算法[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3040-3045.
[11]	曾兰兰, 王以松, 陈攀峰. 基于BERT和联合学习的裁判文书命名实体识别[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3011-3017.
[12]	王小鹏, 孙媛媛, 林鸿飞. 基于刑事Electra的编-解码关系抽取模型[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 87-93.
[13]	刘雅璇, 钟勇. 基于头实体注意力的实体关系联合抽取方法[J]. 计算机应用, 2021, 41(9): 2517-2522.
[14]	武国亮, 徐继宁. 基于命名实体识别任务反馈增强的中文突发事件抽取方法[J]. 计算机应用, 2021, 41(7): 1891-1896.
[15]	王朱君, 王石, 李雪晴, 朱俊武. 基于深度学习的事件因果关系抽取综述[J]. 《计算机应用》唯一官方网站, 2021, 41(5): 1247-1255.

融合自举与语义角色标注的威胁情报实体关系抽取方法

Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 15

参考文献 29

相关文章 15

编辑推荐

Metrics