《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (5): 1445-1453.DOI: 10.11772/j.issn.1001-9081.2022040551

• 人工智能 • 上一篇    下一篇

融合自举与语义角色标注的威胁情报实体关系抽取方法

程顺航, 李志华(), 魏涛   

  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
  • 收稿日期:2022-04-25 修回日期:2022-06-30 接受日期:2022-07-01 发布日期:2022-08-05 出版日期:2023-05-10
  • 通讯作者: 李志华
  • 作者简介:程顺航(1998—),男,湖北荆门人,硕士研究生,主要研究方向:自然语言处理、信息安全
    李志华(1969—),男,湖南保靖人,教授,博士,主要研究方向:端边云关键技术与信息安全,及其与人工智能等前沿学科的交叉研究 jswxzhli@aliyun.com
    魏涛(1998—),男,湖北襄阳人,硕士研究生,主要研究方向:信息系统分析、信息系统安全。
  • 基金资助:
    工业和信息化部智能制造项目(ZH?XZ?180004);中央高校基本科研业务费专项资金资助项目(JUSRP211A41)

Threat intelligence entity relation extraction method integrating bootstrapping and semantic role labeling

Shunhang CHENG, Zhihua LI(), Tao WEI   

  1. School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi Jiangsu 214122,China
  • Received:2022-04-25 Revised:2022-06-30 Accepted:2022-07-01 Online:2022-08-05 Published:2023-05-10
  • Contact: Zhihua LI
  • About author:CHENG Shunhang, born in 1998, M. S. candidate. His research interests include natural language processing, information security.
    LI Zhihua, born in 1969, Ph. D., professor. His research interests include key technologies of end edge cloud and information security and their interdisciplinary research with frontier disciplines such as artificial intelligence.
    WEI Tao, born in 1998, M. S. candidate. His research interests include information system analysis, information system security.
  • Supported by:
    Intelligent Manufacturing Program of Ministry of Industry and Information Technology(ZH-XZ-180004);Fundamental Research Funds for Central Universities(JUSRP211A41)

摘要:

为高效地自动挖掘开源异构大数据中的威胁情报实体和关系,提出一种威胁情报实体关系抽取(TIERE)方法。首先,通过分析开源网络安全报告的特点,研究并提出一种数据预处理方法;然后,针对网络安全领域文本复杂度高、标准数据样本集少的问题,提出基于改进自举法的命名实体识别(NER-IBS)算法和基于语义角色标注的关系抽取(RE-SRL)算法。利用少量样本和规则构建初始种子,通过迭代训练挖掘非结构化文本中的实体,并通过构建语义角色的策略挖掘实体之间的关系。实验结果表明,在少样本网络安全信息抽取数据集上,NER-IBS算法的F1值为84%,与RDF-CRF (Regular expression and Dictionary combined with Feature templates as well as Conditional Random Field)算法相比提高了2个百分点,且RE-SRL算法对于无类别关系抽取的F1值为94%,说明TIERE方法具有高效的实体关系抽取能力。

关键词: 实体识别, 关系抽取, 威胁情报, 自举法, 语义角色标注

Abstract:

To efficiently and automatically mine threat intelligence entities and their relations in open source heterogeneous big data, a Threat Intelligence Entity Relation Extraction (TIERE) method was proposed. Firstly, a data preprocessing method was studied and presented by analyzing the characteristics of the open source cyber security reports. Then, an Improved BootStrapping-based Named Entity Recognition (NER-IBS) algorithm and a Semantic Role Labeling-based Relation Extraction (RE-SRL) algorithm were developed for the problems of high text complexity and small standard dataset in cyber security field. Initial seeds were constructed by using a small number of samples and rules, the entities in the unstructured text were mined through iterative training, and the relations between entities were mined by the strategy of constructing semantic roles. Experimental results show that on the few-shot cyber security information extraction dataset, the F1 value of the NER-IBS algorithm is 84%, which is 2 percentage points higher than that of the RDF-CRF (Regular expression and Dictionary combined with Feature templates as well as Conditional Random Field) algorithm, and the F1 value of RE-SRL algorithm for uncategorized relation extraction is 94%, proving that TIERE method has efficient entity and relation extraction capability.

Key words: entity recognition, relation extraction, threat intelligence, bootstrapping, semantic role labeling

中图分类号: