基于扩散模型微调的高保真图像编辑

doi:10.11772/j.issn.1001-9081.2023111570

摘要/Abstract

摘要：

针对目前主流的图像编辑方法存在任务单一、操作不友好、保真度低等问题，提出一种基于扩散模型对图像进行高保真编辑的方法。该方法将目前主流的稳定扩散模型作为骨干网络，首先使用低秩适用（LoRA）方法对模型进行微调，使模型能够更好地重建原始图像；其次，使用微调后的模型将图片与简单的提示词通过设计的框架进行推理，最终生成编辑后图像。另外，在上述方法基础上扩展提出了双层U-Net结构用于特定需求的图像编辑任务以及视频合成。与领先的方法Imagic、DiffEdit、InstructPix2Pix在Tedbench数据集上的对比实验结果显示：所提方法能够对图像进行包括非刚性编辑的多种编辑任务，可编辑性强；而且在学习感知块相似性（LPIPS）指数上比Imagic下降了30.38%，表明该方法具有更高的保真度。

关键词: 扩散模型, 图像编辑, 低秩适用, 模型微调, U-Net

Abstract:

Addressing the issues such as single task， user-unfriendliness， and low-fidelity in current mainstream image editing methods， a diffusion model-based method for high-fidelity image editing was proposed. In the method， with the mainstream stable diffusion model as the backbone network， initially， the model was fine-tuned using Low Rank Adaptation （LoRA） method， so that the model could better reconstruct the original images. Subsequently， the refined model was employed to infer images with simple prompts through a designed framework， ultimately generating edited images. Furthermore， a dual-layer U-Net structure was proposed extensively based on the aforementioned method for specific image editing tasks and video synthesis. Comparative experiments with leading methods Imagic， DiffEdit， and InstructPix2Pix on Tedbench dataset demonstrate that the proposed method can perform various editing tasks to images， including non-rigid editing， with strong editability， and it also has a 30.38% decrease in Learned Perceptual Image Patch Similarity （LPIPS） index compared to Imagic， indicating that the proposed method has a higher fidelity.

Key words: diffusion model, image editing, Low-Rank Adaptation (LoRA), model fine-tuning, U-Net

中图分类号:

TP391.41

刘雨生, 肖学中. 基于扩散模型微调的高保真图像编辑[J]. 计算机应用, 2024, 44(11): 3574-3580.

Yusheng LIU, Xuezhong XIAO. High-fidelity image editing based on fine-tuning of diffusion model[J]. Journal of Computer Applications, 2024, 44(11): 3574-3580.

图/表 10

图1 本文方法图像编辑效果图

Fig. 1 Image editing results of proposed method

图2 本文方法框架

Fig. 2 Framework of proposed algorithm

图3 Attention Maps加权

Fig. 3 Attention Maps weighting

图4 双层U-Net的结构

Fig. 4 Structure of dual-layer U-Net

图5 TedBench数据集上的编辑结果

Fig. 5 Editing results on TedBench dataset

图6 5种方法的编辑结果对比

Fig. 6 Comparison of editing results of five methods

表1 编辑结果有效数对比

Tab. 1 Comparison of significant figures of editing results

方法	非刚性编辑（60个）	物体增加（21个）	主体替换（11个）	背景替换（8个）	总任务（100个）
本文方法	58	21	11	8	98
Imagic	56	20	11	8	95
Img2Img	46	13	3	5	67
DiffEdit	11	4	10	1	26
InstructPix2Pix	12	12	10	6	40

表 2 保真度的对比

Tab. 2 Comparison of fidelity

方法	CLIP Score	LPIPS
Imagic	25.186 2	0.550 7
Img2Img	23.885 5	0.634 5
本文方法	25.318 1	0.383 4

图7 本文方法与Imagic方法的数据对比

Fig. 7 Data comparison of proposed method with Imagic

图8 姿势引导下双层U-Net的编辑结果

Fig. 8 Dual-layer U-Net editing results with postural guidance

参考文献 40

1	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
2	ZHANG R， ISOLA P， EFROS A A， et al. The unreasonable effectiveness of deep features as a perceptual metric［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 586-595.
3	BROOKS T， HOLYNSKI A， EFROS A A. InstructPix2Pix： learning to follow image editing instructions［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 18392-18402.
4	HERTZ A， MOKADY R， TENENBAUM J， et al. Prompt-to-Prompt image editing with cross-attention control［EB/OL］. ［2023-09-12］..
5	COUAIRON G， VERBEEK J， SCHWENK H， et al. DiffEdit： diffusion-based semantic image editing with mask guidance［EB/OL］. ［2023-08-22］..
6	KAWAR B， ZADA S， LANG O， et al. Imagic： text-based real image editing with diffusion models［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 6007-6017.
7	GOODFELLOW I， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial nets［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge： MIT Press， 2014： 2672-2680.
8	NICHOL A， DHARIWAL P. Improved denoising diffusion probabilistic models［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8162-8171.
9	HO J， JAIN A， ABBEEL P. Denoising diffusion probabilistic models［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020： 6840-6851.
10	LIU H， WAN Z， HUANG W， et al. PD-GAN： probabilistic diverse GAN for image inpainting［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 9367-9376.
11	JING Y， YANG Y， FENG Z， et al. Neural style transfer： a review［J］. IEEE Transactions on Visualization and Computer Graphics， 2020， 26（11）： 3365-3385.
12	ZHU J Y， PARK T， ISOLA P， et al. Unpaired image-to-image translation using cycle-consistent adversarial networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2223-2232.
13	PAN X， TEWARI A， LEIMKÜHLER T， et al. Drag your GAN： interactive point-based manipulation on the generative image manifold［C］// Proceedings of the 2023 ACM SIGGRAPH Conference. New York： ACM， 2023： No.78.
14	ABDAL R， ZHU P， MITRA N J， et al. StyleFlow： attribute-conditioned exploration of StyleGAN-generated images using conditional continuous normalizing flows［J］. ACM Transactions on Graphics， 2021， 40（3）： No.21.
15	PATASHNIK O， WU Z， SHECHTMAN E， et al. StyleCLIP： text-driven manipulation of StyleGAN imagery［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 2065-2074.
16	GAL R， PATASHNIK O， MARON H， et al. StyleGAN-nada： clip-guided domain adaptation of image generators［J］. ACM Transactions on Graphics， 2022， 41（4）： 1-13.
17	XIA W， YANG Y， XUE J H， et al. TediGAN： text-guided diverse face image generation and manipulation［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 2256-2265.
18	ABDAL R， ZHU P， FEMIANI J， et al. CLIP2StyleGAN： unsupervised extraction of StyleGAN edit directions［C］// Proceedings of the 2022 ACM SIGGRAPH Conference. New York： ACM， 2022： No.48.
19	CROWSON K， BIDERMAN S， KORNIS D， et al. VQGAN-CLIP： open domain image generation and editing with natural language guidance［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13697. Cham： Springer， 2022： 88-105.
20	ESSER P， ROMBACH R， OMMER B. Taming Transformers for high-resolution image synthesis［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 12868-12878.
21	MOKADY R， TOV O， YAROM M， et al. Self-distilled styleGAN： towards generation from internet photos［C］// Proceedings of the 2022 ACM SIGGRAPH Conference. New York： ACM， 2022： No.50.
22	GONG S， LI M， FENG J， et al. DiffuSeq： sequence to sequence text generation with diffusion models［EB/OL］. ［2023-10-12］..
23	RAMESH A， PAVLOV M， GOH G， et al. Zero-shot text-to-image generation［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8821-8831.
24	ROMBACH R， BLATTMANN A， LORENZ D， et al. High-resolution image synthesis with latent diffusion models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10674-10685.
25	ZHANG L， RAO A， AGRAWALA M. Adding conditional control to text-to-image diffusion models［C］// Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2023： 3813-3824.
26	LUGMAYR A， DANELLJAN M， ROMERO A， et al. RePaint： inpainting using denoising diffusion probabilistic models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 11451-11461.
27	SAHARIA C， CHAN W， CHANG H， et al. Palette： image-to-image diffusion models［C］// Proceedings of the 2022 ACM SIGGRAPH Conference. New York： ACM， 2022： No.15.
28	RUIZ N， LI Y， JAMPANI V， et al. DreamBooth： fine tuning text-to-image diffusion models for subject-driven generation［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 22500-22510.
29	SUN Z， ZHOU Y， HE H， et al. SGDiff： a style guided diffusion model for fashion synthesis［C］// Proceedings of the 31st ACM International Conference on Multimedia. New York： ACM， 2023： 8433-8442.
30	MENG C， HE Y， SONG Y， et al. SDEdit： guided image synthesis and editing with stochastic differential equations［EB/OL］. ［2023-08-05］..
31	KIM G， KWON T， YE J C. DiffusionCLIP： text-guided diffusion models for robust image manipulation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 2416-2425.
32	HOU C， WEI G， CHEN Z. High-fidelity diffusion-based image editing［C］// Proceedings of the 38th AAAI Conference of Artificial Intelligence. Palo Alto， CA： AAAI Press， 2024： 2184-2192.
33	VALEVSKI D， KALMAN M， MOLAD E， et al. UniTune： text-driven image editing by fine tuning a diffusion model on a single image［J］. ACM Transactions on Graphics， 2023， 42（4）： No.128.
34	AVRAHAMI O， LISCHINSKI D， FRIED O. Blended diffusion for text-driven editing of natural images［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 18187-18197.
35	HU E， SHEN Y， WALLIS P， et al. LoRA： low-rank adaptation of large language models［EB/OL］. ［2023-06-22］..
36	RONNEBERGER O， FISCHER P， BROX T. U-Net： convolutional networks for biomedical image segmentation［C］// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention， LNCS 9351. Cham： Springer， 2015： 234-241.
37	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
38	SONG J， MENG C， ERMON S. Denoising diffusion implicit models［EB/OL］. ［2023-11-25］..
39	CHAKRAVARTHI A， GURURAJA H S. Classifier-free guidance for Generative Adversarial Networks （GANs）［C］// Proceedings of the 2022 International Conference on Intelligent Computing and Communication， AISC 1447. Singapore： Springer， 2023： 217-232.
40	CAO M， WANG X， QI Z， et al. MasaCtrl： tuning-free mutual self-attention control for consistent image synthesis and editing［C］// Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2023： 22503-22513.

[1]	李晨阳, 张龙, 郑秋生, 钱少华. 基于扩散序列的多元可控文本生成[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2414-2420.
[2]	许立君, 黎辉, 刘祖阳, 陈侃松, 马为駽. 基于3D‑Ghost卷积神经网络的脑胶质瘤MRI图像分割算法3D‑GA‑Unet[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1294-1302.
[3]	徐劲松, 朱明, 李智强, 郭世杰. 基于激发和汇聚注意力的扩散模型生成对象的位置控制方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1093-1098.
[4]	周迪, 张自力, 陈佳, 胡新荣, 何儒汉, 张俊. 基于EfficientNetV2和物体上下文表示的胃癌图像分割方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2955-2962.
[5]	陈靖超, 徐树公, 丁友东. 基于字体字符属性引导的文本图像编辑方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1416-1421.
[6]	杨有, 张汝荟, 许鹏程, 康慷, 翟浩. 面向民国档案印章分割的改进U-Net[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 943-948.
[7]	朱利安, 张鸿. 基于双分支条件生成对抗网络的非均匀图像去雾[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 567-574.
[8]	张志昂, 廖光忠. 基于U-Net的多尺度特征增强视网膜血管分割算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3275-3281.
[9]	林荐壮, 杨文忠, 谭思翔, 周乐鑫, 陈丹妮. 融合滤波增强和反转注意力网络用于息肉分割[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 265-272.
[10]	靳华中, 张修洋, 叶志伟, 张闻其, 夏小鱼. 基于近似U型网络结构的图像去噪模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2571-2577.
[11]	徐光柱, 林文杰, 陈莎, 匡婉, 雷帮军, 周军. U-Net与自适应阈值脉冲耦合神经网络相结合的眼底血管分割方法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 825-832.
[12]	吴奇文, 王建华, 郑翔, 冯居, 姜洪岩, 王昱博. 基于改进U-Net的水草图像分割方法[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3177-3183.
[13]	黄梨, 卢龙. 基于长距离依赖编码与深度残差U-Net的缺血性卒中病灶分割[J]. 计算机应用, 2021, 41(6): 1820-1827.
[14]	高海军, 曾祥银, 潘大志, 郑伯川. 基于U-Net改进模型的直肠肿瘤分割方法[J]. 计算机应用, 2020, 40(8): 2392-2397.
[15]	马金林, 魏萌, 马自萍. 基于深度迁移学习的肺结节分割方法[J]. 计算机应用, 2020, 40(7): 2117-2125.