基于激发和汇聚注意力的扩散模型生成对象的位置控制方法

doi:10.11772/j.issn.1001-9081.2023050634

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1093-1098.DOI: 10.11772/j.issn.1001-9081.2023050634

• 人工智能 • 上一篇

基于激发和汇聚注意力的扩散模型生成对象的位置控制方法

徐劲松, 朱明(), 李智强, 郭世杰

湖北大学计算机与信息工程学院，武汉 430062

收稿日期:2023-05-23 修回日期:2023-09-12 接受日期:2023-09-28 发布日期:2023-10-17 出版日期:2024-04-10
通讯作者: 朱明
作者简介:徐劲松（1999—），男，湖北襄阳人，硕士研究生，主要研究方向：图像处理
朱明（1978—），男，湖北武汉人，副教授，硕士，主要研究方向：大数据、人工智能 zm@hubu.edu.cn
李智强（1999—），男，湖北咸宁人，硕士研究生，主要研究方向：自然语言处理
郭世杰（1999—），男，河南南阳人，硕士研究生，主要研究方向：自然语言处理。
基金资助:
国家自然科学基金资助项目(62106069)

Location control method for generated objects by diffusion model with exciting and pooling attention

Jinsong XU, Ming ZHU(), Zhiqiang LI, Shijie GUO

College of Computer and Information Engineering，Hubei University，Wuhan Hubei 430062，China

Received:2023-05-23 Revised:2023-09-12 Accepted:2023-09-28 Online:2023-10-17 Published:2024-04-10
Contact: Ming ZHU
About author:XU Jinsong， born in 1999， M. S. candidate. His research interests include image processing.
ZHU Ming， born in 1978， M. S.， associate professor. His research interests include big data， artificial intelligence.
LI Zhiqiang， born in 1999， M. S. candidate. His research interests include natural language processing.
GUO Shijie， born in 1999， M. S. candidate. His research interests include natural language processing.
Supported by:
National Natural Science Foundation of China(62106069)

摘要/Abstract

摘要：

由于文本的模糊性和训练数据中位置信息的缺失，当前先进的扩散模型无法在文本提示的条件下准确控制生成对象在图像中的位置。针对这一问题，加入对象位置范围的空间条件，并基于U-Net中的交叉注意力图和图像空间布局的强关联性，提出一种注意力引导方法控制注意力图的生成，以控制对象的生成位置。具体地，基于稳定扩散（SD）模型，在U-Net层中的交叉注意力图生成的早期阶段，通过引入损失激发相应位置范围的高注意力值，减小范围外的平均注意力值，并在每一个去噪步骤中逐步优化隐空间中的噪声向量，从而控制注意力图的生成。实验结果表明，所提方法能明显控制一个或多个对象在生成图像中的位置，并在生成多个对象时能减少对象缺失、生成冗余对象和对象融合的现象。

关键词: 注意力图, 扩散模型, 位置控制, 文本引导, 图像生成

Abstract:

Due to the ambiguity of text and the lack of location information in training data， current state-of-the-art diffusion model cannot accurately control the locations of generated objects in the image under the condition of text prompts. To address this issue， a spatial condition of the object’s location range was introduced， and an attention-guided method was proposed based on the strong correlation between the cross-attention map in U-Net and the image spatial layout to control the generation of the attention map， thus controlling the locations of the generated objects. Specifically， based on the Stable Diffusion （SD） model， in the early stage of the generation of the cross-attention map in the U-Net layer， a loss was introduced to stimulate high attention values in the corresponding location range， and reduce the average attention value outside the range. The noise vector in the latent space was optimized step by step in each denoising step to control the generation of the attention map. Experimental results show that the proposed method can significantly control the locations of one or more objects in the generated image， and when generating multiple objects， it can reduce the phenomenon of object omission， redundant object generation， and object fusion.

Key words: attention map, diffusion model, location control, text guidance, image generation

中图分类号:

TP391.4

徐劲松, 朱明, 李智强, 郭世杰. 基于激发和汇聚注意力的扩散模型生成对象的位置控制方法[J]. 计算机应用, 2024, 44(4): 1093-1098.

Jinsong XU, Ming ZHU, Zhiqiang LI, Shijie GUO. Location control method for generated objects by diffusion model with exciting and pooling attention[J]. Journal of Computer Applications, 2024, 44(4): 1093-1098.

图/表 7

图1 使用扩散模型进行文本引导的图像生成任务中产生的平均注意力图

Fig. 1 Average attention maps generated by diffusion model in text-guided image generation task

图 2 单个去噪步骤中的优化过程

Fig. 2 Optimization process in single denoising step

图3 使用文本提示“a cat swimming in water”为条件引导的生成图像和注意力图

Fig. 3 Generated images and attention maps using text prompt “a cat swimming in water” as condition for guidance

图 4 使用文本提示“a bird flying in the sky”为条件引导的生成图像和注意力图

Fig.4 Generated images and attention maps using text prompt “a bird flying in the sky” as condition for guidance

图 5 使用文本提示“a red sphere and a blue cube”为条件引导的生成图像和注意力图

Fig.5 Generated images and attention maps using text prompt “a red sphere and a blue cube” as condition for guidance

图 6 使用文本提示“a cat and a dog”为条件引导的生成图像和注意力图

Fig.6 Generated images and attention maps using text prompt “a cat and a dog” as condition for guidance

图 7 使用文本提示“a dog watching a bird”为条件引导的生成图像和注意力图

Fig. 7 Generated images and attention maps using text prompt “a dog watching a bird” as condition for guidance

参考文献 21

1	SAHARIA C， CHAN W， SAXENA S， et al. Photorealistic text-to-image diffusion models with deep language understanding［J］. Advances in Neural Information Processing Systems， 2022， 35： 36479-36494. 10.1145/3528233.3530757
2	SAHARIA C， HO J， CHAN W， et al. Image super-resolution via iterative refinement［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2022， 45（4）： 4713-4726.
3	NICHOL A， DHARIWAL P， RAMESH A， et al. GLIDE： towards photorealistic image generation and editing with text-guided diffusion models［EB/OL］. （2022-03-08）［2023-05-10］. .
4	ROMBACH R， BLATTMANN A， LORENZ D， et al. High-resolution image synthesis with latent diffusion models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10684-10695. 10.1109/cvpr52688.2022.01042
5	HO J， JAIN A， ABBEEL P. Denoising diffusion probabilistic models［J］. Advances in Neural Information Processing Systems， 2020， 33： 6840-6851. 10.48550/arXiv.2006.11239
6	DHARIWAL P， NICHOL A. Diffusion models beat GANs on image synthesis［J］. Advances in Neural Information Processing Systems， 2021， 34： 8780-8794. 10.48550/arXiv.2105.05233
7	ZHENG G， LI S， WANG H， et al. Entropy-driven sampling and training scheme for conditional diffusion generation［C］// Proceedings of the 17th European Conference on Computer Vision. Cham： Springer， 2022： 754-769. 10.1007/978-3-031-20047-2_43
8	ZHANG C， ZHANG C， ZHANG M， et al. Text-to-image diffusion model in generative AI： a survey［EB/OL］. （2023-03-14）［2023-04-02］. . 10.1109/tcsvt.2023.3307554/mm1
9	KAWAR B， ELAD M， ERMON S， et al. Denoising diffusion restoration models［J］. Advances in Neural Information Processing Systems， 2022， 35： 23593-23606.
10	LUGMAYR A， DANELLJAN M， ROMERO A， et al. RePaint： inpainting using denoising diffusion probabilistic models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 11461-11471. 10.1109/cvpr52688.2022.01117
11	MANSIMOV E， PARISOTTO E， BA J L， et al. Generating images from captions with attention［EB/OL］. （2016-02-29）［2023-05-10］. .
12	SCHUHMANN C， VENCU R， BEAUMONT R， et al. LAION-400M： open dataset of CLIP-filtered 400 million image-text pairs［EB/OL］. （2021-11-03）［2023-05-10］. .
13	SOHL-DICKSTEIN J， WEISS E， MAHESWARANATHAN N， et al. Deep unsupervised learning using nonequilibrium thermodynamics［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 2256-2265. 10.48550/arXiv.1503.03585
14	SONG J， MENG C， ERMON S. Denoising diffusion implicit models［EB/OL］. （2022-10-05）［2023-05-10］. .
15	SONG Y， ERMON S. Generative modeling by estimating gradients of the data distribution［EB/OL］. （2020-10-10）［2023-05-10］. . 10.47743/asas-2020-2-614-542
16	HO J， SALIMANS T. Classifier-free diffusion guidance［EB/OL］. （2022-07-26）［2023-05-10］. .
17	LIU V， CHILTON L B. Design guidelines for prompt engineering text-to-image generative models［C］// Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. New York： ACM， 2022： 384. 10.1145/3491102.3501825
18	WITTEVEEN S， ANDREWS M. Investigating prompt engineering in diffusion models［EB/OL］. （2022-11-21）［2023-05-10］. .
19	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
20	HERTZ A， MOKADY R， TENENBAUM J， et al. Prompt-to-prompt image editing with cross attention control［EB/OL］. （2022-08-02）［2023-05-10］. .
21	CHEFER H， ALALUF Y， VINKER Y， et al. Attend-and-Excite： attention-based semantic guidance for text-to-image diffusion models［J］. ACM Transactions on Graphics， 2023， 42（4）： 148. 10.5715/jnlp.6.7_1

[1]	姚英茂, 姜晓燕. 基于图卷积网络与自注意力图池化的视频行人重识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 728-735.
[2]	杨红, 张贺, 靳少宁. 融合卷积与多头注意力的人体姿态迁移模型[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3403-3410.
[3]	郭茂祖, 杨倩楠, 赵玲玲. 基于条件Wassertein生成对抗网络的图像生成[J]. 计算机应用, 2021, 41(5): 1432-1437.
[4]	陈佛计, 朱枫, 吴清潇, 郝颖明, 王恩德. 基于生成对抗网络的红外图像数据增强[J]. 计算机应用, 2020, 40(7): 2084-2088.
[5]	翟东海左文杰段维夏鱼江李同亮. 基于双十字曲率驱动扩散模型的图像修复算法[J]. 计算机应用, 2013, 33(12): 3536-3539.
[6]	张燕芳熊海灵. 基于Bass与元胞自动机混合模型的快速消费品产品扩散研究[J]. 计算机应用, 2011, 31(12): 3305-3308.

基于激发和汇聚注意力的扩散模型生成对象的位置控制方法

Location control method for generated objects by diffusion model with exciting and pooling attention

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 21

相关文章 6

编辑推荐

Metrics