Backdoor attack method for visual language models based on block perturbation

doi:10.11772/j.issn.1001-9081.2025101281

Journal of Computer Applications

Received:2025-11-03 Revised:2026-01-06 Accepted:2026-01-08 Online:2026-03-25 Published:2026-03-25

基于分块扰动的视觉语言模型后门攻击方法

黄赞¹，袁得嵛^1,2*，杨懿¹，苗博¹

1.中国人民公安大学信息网络安全学院，北京 100038；
2.安全防范与风险评估公安部重点实验室(中国人民公安大学)，北京 102623

通讯作者: 袁得嵛
基金资助:
公安部技术研究计划重点项目

Abstract

Abstract: With the advancement of multimodal technology, the security issues of Visual Language Models (VLMs) have garnered significant attention. Current backdoor attack methods against VLMs suffer from low stealthiness, monotonous patterns, poor attack effectiveness, and high required costs. To address these issues, a patch-adaptive perturbation-based backdoor attack method was proposed for the first time from the perspective of the model's visual encoder architecture. Firstly, the attack target identifier was defined as a character string and encoded into an invisible perturbation signal using steganographic techniques. Secondly, building upon the patch encoding concept of the Vision Transformer (ViT) architecture, the perturbation was adaptively embedded into various image patches, making the poisoned images carrying triggers visually indistinguishable from clean ones. Finally, by poisoning the data during the instruction fine-tuning stage, the multimodal feature fusion process of the model was manipulated to hijack its cross-modal alignment, ultimately achieving the attack effect of inducing the model to output malicious text. Experimental results demonstrate that the proposed method achieves an attack success rate of 96% under a low poisoning rate not exceeding 5%, with a detection success rate remaining below 15%. Its stealthiness surpasses that of various advanced methods, and it demonstrates strong transferability across different VLMs.

Key words: Vision Transformer(ViT), Large Language Model(LLM), backdoor attacks, data poisoning, instruction fine-tuning

摘要： 随着多模态技术的发展，视觉语言模型(VLMs)的安全问题引人注目。针对当前视觉语言模型的后门攻击方法存在隐蔽性低、模式单一、攻击效果差和所需成本高的问题，本文研究从模型视觉编码器架构的角度，提出一种基于分块自适应扰动的后门攻击方法。首先，将攻击目标标识为字符串，并运用隐写技术将它编码为不可见的扰动信号；其次，基于视觉transformer(ViT)架构的分块编码理念，将扰动自适应地嵌入到各个图像分块中，使得携带触发器的中毒图像在视觉上与干净图像难以区分；最后，通过在指令微调阶段的数据投毒操控模型的多模态特征融合过程，进而劫持它的跨模态对齐，实现模型输出恶意文本的攻击效果。实验结果表明，所提方法在不超过5%的低中毒率下，攻击成功率达到96%，检测成功率低于15%，且对不同VLM表现出泛化性。

关键词: 视觉Transformer, 大语言模型, 后门攻击, 数据投毒, 指令微调

CLC Number:

TP309
TP18

黄赞袁得嵛杨懿苗博. 基于分块扰动的视觉语言模型后门攻击方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025101281.