《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2288-2295.DOI: 10.11772/j.issn.1001-9081.2024070918
收稿日期:
2024-07-03
修回日期:
2024-10-20
接受日期:
2024-10-21
发布日期:
2025-07-10
出版日期:
2025-07-10
通讯作者:
陈永乐
作者简介:
张立孝(1999—),男,山西吕梁人,硕士研究生,CCF会员,主要研究方向:物联网安全基金资助:
Lixiao ZHANG, Yao MA, Yuli YANG, Dan YU, Yongle CHEN()
Received:
2024-07-03
Revised:
2024-10-20
Accepted:
2024-10-21
Online:
2025-07-10
Published:
2025-07-10
Contact:
Yongle CHEN
About author:
ZHANG Lixiao, born in 1999, M. S. candidate. His research interests include internet of things security.Supported by:
摘要:
物联网(IoT)设备厂商在固件开发中通常会大量复用基于开源代码编译而成的开源组件,每个固件通常由上百个这样的组件构成。如果这些组件未能及时更新,未打上安全补丁的开源组件可能会携带着漏洞集成到固件中,进而给IoT设备埋下安全隐患。因此,识别IoT固件中的二进制组件对于确保IoT设备的安全性至关重要。针对现有方法难以大规模识别二进制组件的问题,提出一种基于命名实体识别(NER)的大规模IoT二进制组件识别方法。首先,通过固件解压提取固件内部的二进制组件;然后,通过可读字符串提取和组件执行这两个方式获取组件的语义信息;最后,利用RoBERTa-BiLSTM-CRF的NER模型识别组件名和版本号。在12个流行的IoT生产商发布的6 575个固件上的实验结果表明,所提方法获得了87.67%的F1值,可成功识别163个二进制组件。可见,该方法有效扩大了IoT固件中二进制组件的识别范围,有助于从软件供应链的角度保障固件安全。
中图分类号:
张立孝, 马垚, 杨玉丽, 于丹, 陈永乐. 基于命名实体识别的大规模物联网二进制组件识别[J]. 计算机应用, 2025, 45(7): 2288-2295.
Lixiao ZHANG, Yao MA, Yuli YANG, Dan YU, Yongle CHEN. Large-scale IoT binary component identification based on named entity recognition[J]. Journal of Computer Applications, 2025, 45(7): 2288-2295.
语义信息分类 | 语义信息示例 | |
---|---|---|
组件版本语义信息 | BusyBox v1.1.2 (2018.10.19-17:25+0000) | |
非组件版本语义信息 | 依赖库信息 | libcrypto.so.1.0.0 |
GLIBC_2.0 | ||
2.1.117 | ||
v5.4 | ||
udhcp 0.9.9-pre | ||
… | ||
格式化字符串 | %-8.8s | |
%4.1f%% | ||
%u blocks (%2.2f%%) reserved for the super user | ||
%-9.9s Link encap:%s | ||
GET %stp://[%s]:%d/%s HTTP/1.1 | ||
… | ||
其他信息 | /opt_2/src/RTL_360_F5S/user/busybox/busybox-1.1.2/e2fsprogs/e2fsck.c | |
$Id: vi.c,v 1.38 2004/08/19 19:15:06 andersen Exp $ This is not GNU sed version 4.0 | ||
some 2.4 kernels do not support blocksizes greater than 4096 using ext3. | ||
… |
表1 可读字符串示例
Tab. 1 Examples of readable string
语义信息分类 | 语义信息示例 | |
---|---|---|
组件版本语义信息 | BusyBox v1.1.2 (2018.10.19-17:25+0000) | |
非组件版本语义信息 | 依赖库信息 | libcrypto.so.1.0.0 |
GLIBC_2.0 | ||
2.1.117 | ||
v5.4 | ||
udhcp 0.9.9-pre | ||
… | ||
格式化字符串 | %-8.8s | |
%4.1f%% | ||
%u blocks (%2.2f%%) reserved for the super user | ||
%-9.9s Link encap:%s | ||
GET %stp://[%s]:%d/%s HTTP/1.1 | ||
… | ||
其他信息 | /opt_2/src/RTL_360_F5S/user/busybox/busybox-1.1.2/e2fsprogs/e2fsck.c | |
$Id: vi.c,v 1.38 2004/08/19 19:15:06 andersen Exp $ This is not GNU sed version 4.0 | ||
some 2.4 kernels do not support blocksizes greater than 4096 using ext3. | ||
… |
语义信息分类 | 语义信息示例 | |
---|---|---|
组件版本语义信息 | iptables v1.4.21 | |
非组件版本语义信息 | 提示信息 | iptables v1.4.21: no command specified |
Try 'iptables -h' or 'iptables --help' for more information. | ||
命令基本用法 | Usage: iptables -[ACD] chain rule-specification [options] | |
iptables -I chain [rulenum] rule-specification [options] | ||
iptables -R chain rulenum rule-specification [options] | ||
iptables -D chain rulenum [options] | ||
… | ||
命令详细说明 | Commands: | |
Either long or short options are allowed. | ||
--append -A chain Append to chain | ||
--check -C chain Check for the existence of a rule | ||
--delete -D chain Delete matching rule from chain | ||
… | ||
命令中可用的选项详细说明 | Options: | |
--ipv4 -4 Nothing (line is ignored by ip6tables-restore) | ||
--ipv6 -6 Error (line is ignored by iptables-restore) | ||
[!] --protoco l-p proto protocol: by number or name, eg. 'tcp' | ||
… |
表2 组件执行语义输出示例
Tab. 2 Examples of semantic output for component execution
语义信息分类 | 语义信息示例 | |
---|---|---|
组件版本语义信息 | iptables v1.4.21 | |
非组件版本语义信息 | 提示信息 | iptables v1.4.21: no command specified |
Try 'iptables -h' or 'iptables --help' for more information. | ||
命令基本用法 | Usage: iptables -[ACD] chain rule-specification [options] | |
iptables -I chain [rulenum] rule-specification [options] | ||
iptables -R chain rulenum rule-specification [options] | ||
iptables -D chain rulenum [options] | ||
… | ||
命令详细说明 | Commands: | |
Either long or short options are allowed. | ||
--append -A chain Append to chain | ||
--check -C chain Check for the existence of a rule | ||
--delete -D chain Delete matching rule from chain | ||
… | ||
命令中可用的选项详细说明 | Options: | |
--ipv4 -4 Nothing (line is ignored by ip6tables-restore) | ||
--ipv6 -6 Error (line is ignored by iptables-restore) | ||
[!] --protoco l-p proto protocol: by number or name, eg. 'tcp' | ||
… |
实体类型 | 实体标签 | 标签说明 |
---|---|---|
组件名 | B-组件名 | “组件名”的起始位置 |
I-组件名 | “组件名”的中间或结束位置 | |
版本号 | B-版本号 | “版本号”的起始位置 |
I-版本号 | “版本号”的中间或结束位置 |
表3 实体标签说明
Tab. 3 Entity label description
实体类型 | 实体标签 | 标签说明 |
---|---|---|
组件名 | B-组件名 | “组件名”的起始位置 |
I-组件名 | “组件名”的中间或结束位置 | |
版本号 | B-版本号 | “版本号”的起始位置 |
I-版本号 | “版本号”的中间或结束位置 |
环境名称 | 环境参数 |
---|---|
操作系统 | Ubuntu 18.04 |
GPU | 10 GB NVIDIA GeForce RTX 3080 |
编程语言 | Python 3.9 |
深度学习框架 | PyTorch 2.1.0 |
网络爬虫框架 | Scrapy 2.5 |
固件解压 | Binwalk 2.3.3 |
模拟执行 | QEMU 6.1 |
表4 实验环境参数
Tab. 4 Experimental environmental parameters
环境名称 | 环境参数 |
---|---|
操作系统 | Ubuntu 18.04 |
GPU | 10 GB NVIDIA GeForce RTX 3080 |
编程语言 | Python 3.9 |
深度学习框架 | PyTorch 2.1.0 |
网络爬虫框架 | Scrapy 2.5 |
固件解压 | Binwalk 2.3.3 |
模拟执行 | QEMU 6.1 |
参数 | 值 | 参数 | 值 |
---|---|---|---|
max_length | 100 | dropout | 0.3 |
LSTM_size | 128 | learning rate | 0.000 01 |
batch_size | 32 |
表5 模型参数
Tab. 5 Model parameters
参数 | 值 | 参数 | 值 |
---|---|---|---|
max_length | 100 | dropout | 0.3 |
LSTM_size | 128 | learning rate | 0.000 01 |
batch_size | 32 |
组件识别方法 | P | R | F1 |
---|---|---|---|
正则表达式 | 59.69 | 69.75 | 64.33 |
BERT-BiGRU-CRF | 75.88 | 76.52 | 76.20 |
BERT-BiLSTM-CRF | 77.31 | 81.32 | 79.26 |
本文模型 | 89.53 | 85.89 | 87.67 |
表6 组件实体识别的对比实验结果 ( %)
Tab. 6 Comparative experimental results of component entity recognition
组件识别方法 | P | R | F1 |
---|---|---|---|
正则表达式 | 59.69 | 69.75 | 64.33 |
BERT-BiGRU-CRF | 75.88 | 76.52 | 76.20 |
BERT-BiLSTM-CRF | 77.31 | 81.32 | 79.26 |
本文模型 | 89.53 | 85.89 | 87.67 |
RoBERTa | CRF | BiLSTM | 评价指标 | ||
---|---|---|---|---|---|
P | R | F1 | |||
√ | 72.25 | 73.89 | 73.06 | ||
√ | √ | 75.16 | 77.62 | 76.37 | |
√ | √ | √ | 89.53 | 85.89 | 87.67 |
表7 消融实验结果 ( %)
Tab. 7 Results of ablation experiments
RoBERTa | CRF | BiLSTM | 评价指标 | ||
---|---|---|---|---|---|
P | R | F1 | |||
√ | 72.25 | 73.89 | 73.06 | ||
√ | √ | 75.16 | 77.62 | 76.37 | |
√ | √ | √ | 89.53 | 85.89 | 87.67 |
组件识别方法 | 组件识别数 |
---|---|
正则表达式 | 63 |
VES | 5 |
FirmUp | 5 |
FirmSEC | 92 |
本文方法 | 163 |
表8 组件识别数量对比
Tab. 8 Comparison of component identification quantity
组件识别方法 | 组件识别数 |
---|---|
正则表达式 | 63 |
VES | 5 |
FirmUp | 5 |
FirmSEC | 92 |
本文方法 | 163 |
[1] | 樊琳娜,李城龙,吴毅超,等.物联网设备识别及异常检测研究综述[J].软件学报,2024, 35(1): 288-308. |
FAN L N, LI C L, WU Y C, et al. Survey on IoT device identification and anomaly detection [J]. Journal of Software, 2024, 35(1): 288-308. | |
[2] | ZHAO B, JI S, XU J, et al. A large-scale empirical analysis of the vulnerabilities introduced by third-party components in IoT firmware [C]// Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2022: 442-454. |
[3] | 况博裕,张兆博,杨善权,等. HMFuzzer:一种基于人机协同的物联网设备固件漏洞挖掘方案[J].计算机学报,2024, 47(3): 703-716. |
KUANG B Y, ZHANG Z B, YANG S Q, et al. HMFuzzer: a human-machine collaboration-based firmware vulnerability mining scheme for IoT devices [J]. Chinese Journal of Computers, 2024, 47(3): 703-716. | |
[4] | DAVID Y, PARTUSH N, YAHAV E. FirmUp: precise static detection of common vulnerabilities in firmware [J]. ACM SIGPLAN Notices, 2018, 53(2): 392-404. |
[5] | CHENG Y, YANG S, LANG Z, et al. VERI: a large-scale open-source components vulnerability detection in IoT firmware [J]. Computers and Security, 2023, 126: No.103068. |
[6] | LI S, WANG Y, DONG C, et al. LibAM: an area matching framework for detecting third-party libraries in binaries [J]. ACM Transactions on Software Engineering and Methodology, 2024, 33(2): No.52. |
[7] | DONG C, LI S, YANG S, et al. LibvDiff: library version difference guided OSS version identification in binaries [C]// Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. New York: ACM, 2024: No.66. |
[8] | ZHAN X, FAN L, CHEN S, et al. ATVHunter: reliable version detection of third-party libraries for vulnerability identification in Android applications [C]// Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering. Piscataway: IEEE, 2021: 1695-1707. |
[9] | XU X, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection [C]// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376. |
[10] | 高翔,王石,朱俊武,等.命名实体识别任务综述[J].计算机科学,2023, 50(6A): No.220200119. |
GAO X, WANG S, ZHU J W, et al. Overview of named entity recognition tasks [J]. Computer Science, 2023, 50(6A): No.220200119. | |
[11] | HAMMERTON J. Named entity recognition with long short-term memory [C]// Proceedings of the 7th Conference on Natural language learning at HLT-NAACL. Stroudsburg: ACL, 2003: 172-175. |
[12] | LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2016: 260-270. |
[13] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
[14] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. |
[15] | 王浩畅,和婷婷,郑冠彧.融合词汇边界信息的合同实体识别方法[J].计算机工程与设计,2024, 45(6): 1757-1763. |
WANG H C, HE T T, ZHENG G Y. Contract entity recognition method with lexical boundary information [J]. Computer Engineering and Design, 2024, 45(6): 1757-1763. | |
[16] | 闫璟辉,宗成庆,徐金安.中文医疗文本中的嵌套实体识别方法[J].软件学报,2024, 35(6): 2923-2935. |
YAN J H, ZONG C Q, XU J A. Nested entity recognition approach in Chinese medical text [J]. Journal of Software, 2024, 35(6): 2923-2935. | |
[17] | 马健伟,王铁鑫,江宏,等.基于深度语义分析的警务卷宗知识抽取[J].计算机研究与发展,2024, 61(5): 1325-1335. |
MA J W, WANG T X, JIANG H, et al. Knowledge extraction based on deep semantics analysis towards police dossier [J]. Journal of Computer Research and Development, 2024, 61(5): 1325-1335. | |
[18] | LI X, GUO Z, WANG W, et al. An intelligent named entity recognition method based on IoT professional knowledge [C]// Proceedings of the 2nd Asia Conference on Information Engineering. Piscataway: IEEE, 2022: 67-71. |
[19] | WANG Y, WANG Z, LI H, et al. A hybrid Chinese named entity recognition method for Internet of Things [C]// Proceedings of the SPIE 12176, International Conference on Algorithms, Microchips and Network Applications. Bellingham, WA: SPIE, 2022: No.121762A. |
[20] | 隗昊,刁宏悦,孔亮宸,等.东北亚舆情文本细粒度命名实体识别方法研究[J].计算机工程,2024, 50(5): 354-362. |
WEI H, DIAO H Y, KONG L C, et al. Research on fine-grained named-entity-recognition method for public-opinion texts in Northeast Asia [J]. Computer Engineering, 2024, 50(5): 354-362. | |
[21] | 陆鑫涛,孙丽萍,凌晨,等.融入拼音与词性特征的中文电子病历命名实体识别[J/OL].小型微型计算机系统[2024-04-22]. |
LU X T, SUN L P, LING C, et al. Named entity recognition of Chinese electronic health records incorporating phonetic and part-of-speech features [J/OL]. Journal of Chinese Computer Systems[2024-04-22]. | |
[22] | 党小超,刘涧,董晓辉,等.面向不平衡数据的机械设备故障命名实体识别[J].计算机工程,2024, 50(9): 104-112. |
DANG X C, LIU J, DONG X H, et al. Named entity recognition for mechanical equipment failure for imbalanced data [J]. Computer Engineering, 2024, 50(9): 104-112. | |
[23] | HU X, ZHANG W, LI H, et al. VES: a component version extracting system for large-scale IoT firmwares [C]// Proceedings of the 2020 International Conference on Wireless Algorithms, Systems, and Applications, LNCS 12385. Cham: Springer, 2020: 39-48. |
[1] | 徐章杰, 陈艳平, 扈应, 黄瑞章, 秦永彬. 联合边界生成的多目标学习的嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2229-2236. |
[2] | 赵小阳, 许新征, 李仲年. 物联网应用中的可解释人工智能研究综述[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2169-2179. |
[3] | 曾碧卿, 钟广彬, 温志庆. 基于分解式模糊跨度的小样本命名实体识别[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1504-1510. |
[4] | 胡婕, 武帅星, 曹芝兰, 张龑. 基于全域信息融合和多维关系感知的命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1511-1519. |
[5] | 程子栋, 李鹏, 朱枫. 物联网威胁情报知识图谱中潜在关系的挖掘[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 24-31. |
[6] | 吕学强, 王涛, 游新冬, 徐戈. 层次融合多元知识的命名实体识别框架——HTLR[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 40-47. |
[7] | 左志斌, 杨凯, 邓淼磊, 王德民, 马米米. 基于可编程软件定义网络的动态网络防御方案[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 144-152. |
[8] | 孙焕良, 王思懿, 刘俊岭, 许景科. 社交媒体数据中水灾事件求助信息提取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2437-2445. |
[9] | 于右任, 张仰森, 蒋玉茹, 黄改娟. 融合多粒度语言知识与层级信息的中文命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1706-1712. |
[10] | 董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708. |
[11] | 黄子麒, 胡建鹏. 实体类别增强的汽车领域嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 377-384. |
[12] | 罗歆然, 李天瑞, 贾真. 基于自注意力机制与词汇增强的中文医学命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 385-392. |
[13] | 陈姿芊, 牛科迪, 姚中原, 斯雪明. 适用于物联网的区块链轻量化技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3688-3698. |
[14] | 牛科迪, 李敏, 姚中原, 斯雪明. 面向物联网的区块链共识算法综述[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3678-3687. |
[15] | 万义程, 杨光祥, 张庆达, 甘晨阳, 易林. 非坚持型载波监听多路访问机制对LoRa网络扩展性的影响[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2885-2896. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||