Location control method for generated objects by diffusion model with exciting and pooling attention

doi:10.11772/j.issn.1001-9081.2023050634

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (4): 1093-1098.DOI: 10.11772/j.issn.1001-9081.2023050634

• Artificial intelligence • Previous Articles

Location control method for generated objects by diffusion model with exciting and pooling attention

Jinsong XU, Ming ZHU(), Zhiqiang LI, Shijie GUO

College of Computer and Information Engineering，Hubei University，Wuhan Hubei 430062，China

Received:2023-05-23 Revised:2023-09-12 Accepted:2023-09-28 Online:2023-10-17 Published:2024-04-10
Contact: Ming ZHU
About author:XU Jinsong， born in 1999， M. S. candidate. His research interests include image processing.
ZHU Ming， born in 1978， M. S.， associate professor. His research interests include big data， artificial intelligence.
LI Zhiqiang， born in 1999， M. S. candidate. His research interests include natural language processing.
GUO Shijie， born in 1999， M. S. candidate. His research interests include natural language processing.
Supported by:
National Natural Science Foundation of China(62106069)

基于激发和汇聚注意力的扩散模型生成对象的位置控制方法

徐劲松, 朱明(), 李智强, 郭世杰

湖北大学计算机与信息工程学院，武汉 430062

通讯作者: 朱明
作者简介:徐劲松（1999—），男，湖北襄阳人，硕士研究生，主要研究方向：图像处理
朱明（1978—），男，湖北武汉人，副教授，硕士，主要研究方向：大数据、人工智能 zm@hubu.edu.cn
李智强（1999—），男，湖北咸宁人，硕士研究生，主要研究方向：自然语言处理
郭世杰（1999—），男，河南南阳人，硕士研究生，主要研究方向：自然语言处理。
基金资助:
国家自然科学基金资助项目(62106069)

Abstract

Abstract:

Due to the ambiguity of text and the lack of location information in training data， current state-of-the-art diffusion model cannot accurately control the locations of generated objects in the image under the condition of text prompts. To address this issue， a spatial condition of the object’s location range was introduced， and an attention-guided method was proposed based on the strong correlation between the cross-attention map in U-Net and the image spatial layout to control the generation of the attention map， thus controlling the locations of the generated objects. Specifically， based on the Stable Diffusion （SD） model， in the early stage of the generation of the cross-attention map in the U-Net layer， a loss was introduced to stimulate high attention values in the corresponding location range， and reduce the average attention value outside the range. The noise vector in the latent space was optimized step by step in each denoising step to control the generation of the attention map. Experimental results show that the proposed method can significantly control the locations of one or more objects in the generated image， and when generating multiple objects， it can reduce the phenomenon of object omission， redundant object generation， and object fusion.

Key words: attention map, diffusion model, location control, text guidance, image generation

摘要：

由于文本的模糊性和训练数据中位置信息的缺失，当前先进的扩散模型无法在文本提示的条件下准确控制生成对象在图像中的位置。针对这一问题，加入对象位置范围的空间条件，并基于U-Net中的交叉注意力图和图像空间布局的强关联性，提出一种注意力引导方法控制注意力图的生成，以控制对象的生成位置。具体地，基于稳定扩散（SD）模型，在U-Net层中的交叉注意力图生成的早期阶段，通过引入损失激发相应位置范围的高注意力值，减小范围外的平均注意力值，并在每一个去噪步骤中逐步优化隐空间中的噪声向量，从而控制注意力图的生成。实验结果表明，所提方法能明显控制一个或多个对象在生成图像中的位置，并在生成多个对象时能减少对象缺失、生成冗余对象和对象融合的现象。

关键词: 注意力图, 扩散模型, 位置控制, 文本引导, 图像生成

CLC Number:

TP391.4

Jinsong XU, Ming ZHU, Zhiqiang LI, Shijie GUO. Location control method for generated objects by diffusion model with exciting and pooling attention[J]. Journal of Computer Applications, 2024, 44(4): 1093-1098.

徐劲松, 朱明, 李智强, 郭世杰. 基于激发和汇聚注意力的扩散模型生成对象的位置控制方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1093-1098.

Figures/Tables 7

Fig. 1 Average attention maps generated by diffusion model in text-guided image generation task

Fig. 2 Optimization process in single denoising step

Fig. 3 Generated images and attention maps using text prompt “a cat swimming in water” as condition for guidance

Fig.4 Generated images and attention maps using text prompt “a bird flying in the sky” as condition for guidance

Fig.5 Generated images and attention maps using text prompt “a red sphere and a blue cube” as condition for guidance

Fig.6 Generated images and attention maps using text prompt “a cat and a dog” as condition for guidance

Fig. 7 Generated images and attention maps using text prompt “a dog watching a bird” as condition for guidance

References 21

1	SAHARIA C， CHAN W， SAXENA S， et al. Photorealistic text-to-image diffusion models with deep language understanding［J］. Advances in Neural Information Processing Systems， 2022， 35： 36479-36494. 10.1145/3528233.3530757
2	SAHARIA C， HO J， CHAN W， et al. Image super-resolution via iterative refinement［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2022， 45（4）： 4713-4726.
3	NICHOL A， DHARIWAL P， RAMESH A， et al. GLIDE： towards photorealistic image generation and editing with text-guided diffusion models［EB/OL］. （2022-03-08）［2023-05-10］. .
4	ROMBACH R， BLATTMANN A， LORENZ D， et al. High-resolution image synthesis with latent diffusion models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10684-10695. 10.1109/cvpr52688.2022.01042
5	HO J， JAIN A， ABBEEL P. Denoising diffusion probabilistic models［J］. Advances in Neural Information Processing Systems， 2020， 33： 6840-6851. 10.48550/arXiv.2006.11239
6	DHARIWAL P， NICHOL A. Diffusion models beat GANs on image synthesis［J］. Advances in Neural Information Processing Systems， 2021， 34： 8780-8794. 10.48550/arXiv.2105.05233
7	ZHENG G， LI S， WANG H， et al. Entropy-driven sampling and training scheme for conditional diffusion generation［C］// Proceedings of the 17th European Conference on Computer Vision. Cham： Springer， 2022： 754-769. 10.1007/978-3-031-20047-2_43
8	ZHANG C， ZHANG C， ZHANG M， et al. Text-to-image diffusion model in generative AI： a survey［EB/OL］. （2023-03-14）［2023-04-02］. . 10.1109/tcsvt.2023.3307554/mm1
9	KAWAR B， ELAD M， ERMON S， et al. Denoising diffusion restoration models［J］. Advances in Neural Information Processing Systems， 2022， 35： 23593-23606.
10	LUGMAYR A， DANELLJAN M， ROMERO A， et al. RePaint： inpainting using denoising diffusion probabilistic models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 11461-11471. 10.1109/cvpr52688.2022.01117
11	MANSIMOV E， PARISOTTO E， BA J L， et al. Generating images from captions with attention［EB/OL］. （2016-02-29）［2023-05-10］. .
12	SCHUHMANN C， VENCU R， BEAUMONT R， et al. LAION-400M： open dataset of CLIP-filtered 400 million image-text pairs［EB/OL］. （2021-11-03）［2023-05-10］. .
13	SOHL-DICKSTEIN J， WEISS E， MAHESWARANATHAN N， et al. Deep unsupervised learning using nonequilibrium thermodynamics［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 2256-2265. 10.48550/arXiv.1503.03585
14	SONG J， MENG C， ERMON S. Denoising diffusion implicit models［EB/OL］. （2022-10-05）［2023-05-10］. .
15	SONG Y， ERMON S. Generative modeling by estimating gradients of the data distribution［EB/OL］. （2020-10-10）［2023-05-10］. . 10.47743/asas-2020-2-614-542
16	HO J， SALIMANS T. Classifier-free diffusion guidance［EB/OL］. （2022-07-26）［2023-05-10］. .
17	LIU V， CHILTON L B. Design guidelines for prompt engineering text-to-image generative models［C］// Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. New York： ACM， 2022： 384. 10.1145/3491102.3501825
18	WITTEVEEN S， ANDREWS M. Investigating prompt engineering in diffusion models［EB/OL］. （2022-11-21）［2023-05-10］. .
19	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
20	HERTZ A， MOKADY R， TENENBAUM J， et al. Prompt-to-prompt image editing with cross attention control［EB/OL］. （2022-08-02）［2023-05-10］. .
21	CHEFER H， ALALUF Y， VINKER Y， et al. Attend-and-Excite： attention-based semantic guidance for text-to-image diffusion models［J］. ACM Transactions on Graphics， 2023， 42（4）： 148. 10.5715/jnlp.6.7_1

[1]	Hong YANG, He ZHANG, Shaoning JIN. Human pose transfer model combining convolution and multi-head attention [J]. Journal of Computer Applications, 2023, 43(11): 3403-3410.
[2]	GUO Maozu, YANG Qiannan, ZHAO Lingling. Image generation based on conditional-Wassertein generative adversarial network [J]. Journal of Computer Applications, 2021, 41(5): 1432-1437.
[3]	YANG Shuxin, LIANG Wen, ZHU Kaili. Reverse influence maximization algorithm in social networks [J]. Journal of Computer Applications, 2020, 40(7): 1944-1949.
[4]	CHEN Foji, ZHU Feng, WU Qingxiao, HAO Yingming, WANG Ende. Infrared image data augmentation based on generative adversarial network [J]. Journal of Computer Applications, 2020, 40(7): 2084-2088.
[5]	ZHANG Yan-fang XIONG Hai-ling. Product diffusion study of fast moving consumer goods based on hybrid model of Bass and cellular automata models [J]. Journal of Computer Applications, 2011, 31(12): 3305-3308.
[6]	ZHANG Wen-xiang, WANG Xin-hui. Improved bit-allocation algorithm for video coding [J]. Journal of Computer Applications, 2005, 25(01): 125-126.

Location control method for generated objects by diffusion model with exciting and pooling attention

基于激发和汇聚注意力的扩散模型生成对象的位置控制方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 21

Related Articles 6

Recommended Articles

Metrics