融合卷积与多头注意力的人体姿态迁移模型

doi:10.11772/j.issn.1001-9081.2022111707

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3403-3410.DOI: 10.11772/j.issn.1001-9081.2022111707

所属专题：人工智能

融合卷积与多头注意力的人体姿态迁移模型

杨红(), 张贺, 靳少宁

大连海事大学信息科学技术学院，辽宁大连 116026

收稿日期:2022-11-18 修回日期:2022-12-25 接受日期:2022-12-28 发布日期:2023-11-14 出版日期:2023-11-10
通讯作者: 杨红
作者简介:杨红（1977—），女，辽宁葫芦岛人，副教授，博士，主要研究方向：数据挖掘、行为识别 yanghong@dlmu.edu.cn
张贺（1998—），男，山东临沂人，硕士研究生，主要研究方向：图像生成、深度生成模型
靳少宁（1996—），女，甘肃静宁人，硕士研究生，主要研究方向：步态识别、人工智能。

Human pose transfer model combining convolution and multi-head attention

Hong YANG(), He ZHANG, Shaoning JIN

Information Science and Technology College，Dalian Maritime University，Dalian Liaoning 116026，China

Received:2022-11-18 Revised:2022-12-25 Accepted:2022-12-28 Online:2023-11-14 Published:2023-11-10
Contact: Hong YANG
About author:YANG Hong， born in 1977， Ph. D.， associate professor. Her research interests include data mining， behavior recognition.
ZHANG He， born in 1998， M. S. candidate. His research interests include image generation， deep generative models.
JIN Shaoning， born in 1996， M. S. candidate. Her research interests include gait recognition， artificial intelligence.

摘要/Abstract

摘要：

对于给定某个人物的参考图像，人体姿态迁移（HPT）的目标是生成任意姿态下的该人物图像。许多现有的相关方法在捕捉人物外观细节、推测不可见区域方面仍存在不足，特别是对于复杂的姿态变换，难以生成清晰逼真的人物外观。为了解决以上问题，提出一种新颖的融合卷积与多头注意力的HPT模型。首先，融合卷积与多头注意力机制构建卷积-多头注意力（Conv-MHA）模块，提取丰富的上下文特征；其次，利用Conv-MHA模块构建HPT网络，提升所提模型的学习能力；最后，引入参考图像的自我重建作为辅助任务，更充分地发挥所提模型的性能。在DeepFashion和Market-1501数据集上验证了基于Conv-MHA的HPT模型，结果显示：它在DeepFashion测试集上的结构相似性（SSIM）、感知相似度（LPIPS）和FID（Fréchet Inception Distance）指标均优于现有的HPT模型DPTN （Dual-task Pose Transformer Network）。实验结果表明，融合卷积与多头注意力机制的Conv-MHA模块可以提升模型的表示能力，更加有效地捕捉人物外观细节，提升人物图像生成的精度。

关键词: 人体姿态迁移, 图像生成, 生成对抗网络, 多头注意力, 卷积

Abstract:

For a given reference image of a person， the goal of Human Pose Transfer （HPT） is to generate an image of that person in any arbitrary pose. Many existing related methods fail to capture the details of a person’s appearance and have difficulties in predicting invisible regions， especially for complex pose transformation， and it is difficult to generate a clear and realistic person’s appearance. To address the above problems， a new HPT model that integrated convolution and multi-head attention was proposed. Firstly， the Convolution-Multi-Head Attention （Conv-MHA） block was constructed by fusing the convolution and multi-head attention， then it was used to extract rich contextual features. Secondly， to improve the learning ability of the proposed model， the HPT network was constructed by using Conv-MHA block. Finally， the self-reconstruction of the reference image was introduced as an auxiliary task to make the model more fully utilized its performance. The Conv-MHA-based human pose transfer model was validated on DeepFashion and Market-1501 datasets， and the results on DeepFashion test dataset show that it outperforms the state-of-the-art human pose transfer model， DPTN （Dual-task Pose Transformer Network）， in terms of Structural SIMilarity （SSIM）， Learned Perceptual Image Patch Similarity （LPIPS） and FID （Fréchet Inception Distance） indicators. Experimental results show that the Conv-MHA module， which integrates convolution and multi-head attention mechanism， can improve the representation ability of the model， capture the details of person’s appearance more effectively， and improve the accuracy of person image generation.

Key words: Human Pose Transfer (HPT), image generation, generative adversarial network, multi-head attention, convolution

中图分类号:

TP183

杨红, 张贺, 靳少宁. 融合卷积与多头注意力的人体姿态迁移模型[J]. 计算机应用, 2023, 43(11): 3403-3410.

Hong YANG, He ZHANG, Shaoning JIN. Human pose transfer model combining convolution and multi-head attention[J]. Journal of Computer Applications, 2023, 43(11): 3403-3410.

图/表 9

参考文献 38

1	GOODFELLOW I J， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial nets［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge： MIT Press， 2014： 2672-2680.
2	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. （2022-12-10）［2023-03-17］.. 10.1561/2200000056
3	MA L， JIA X， SUN Q， et al. Pose guided person image generation［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 405-415.
4	ESSER P， SUTTER E. A variational U-Net for conditional appearance and shape generation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8857-8866. 10.1109/cvpr.2018.00923
5	LI Y， HUANG C， LOY C C. Dense intrinsic appearance flow for human pose transfer［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 3688-3697. 10.1109/cvpr.2019.00381
6	REN Y， YU X， CHEN J， et al. Deep image spatial transformation for person image generation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 7687-7696. 10.1109/cvpr42600.2020.00771
7	LV Z， LI X， LI X， et al. Learning semantic person image generation by region-adaptive normalization［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 10801-10810. 10.1109/cvpr46437.2021.01066
8	ZHANG J， LI K， LAI Y K， et al. PISE： person image synthesis and editing with decoupled GAN［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 7978-7986. 10.1109/cvpr46437.2021.00789
9	TANG H， BAI S， ZHANG L， et al. XingGAN for person image generation［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12370. Cham： Springer， 2020： 717-734.
10	ZHU Z， HUANG T， SHI B， et al. Progressive pose attention transfer for person image generation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 2342-2351. 10.1109/cvpr.2019.00245
11	ZHANG P， YANG L， LAI J， et al. Exploring dual-task correlation for pose guided person image generation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 7703-7712. 10.1109/cvpr52688.2022.00756
12	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
13	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
14	LI X， WANG W， HU X， et al. Selective kernel networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 510-519. 10.1109/cvpr.2019.00060
15	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
16	SRINIVAS A， LIN T Y， PARMAR N， et al. Bottleneck Transformers for visual recognition［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 16514-16524. 10.1109/cvpr46437.2021.01625
17	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16×16 words： Transformers for image recognition at scale［EB/OL］. （2021-06-03）［2022-06-17］..
18	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Tansformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002. 10.1109/iccv48922.2021.00986
19	DONG X， BAO J， CHEN D， et al. CSWin Transformer： a general vision Transformer backbone with cross-shaped windows［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 12114-12124. 10.1109/cvpr52688.2022.01181
20	VASWANI A， RAMACHANDRAN P， SRINIVAS A， et al. Scaling local self-attention for parameter efficient visual backbones［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 12889-12899. 10.1109/cvpr46437.2021.01270
21	LI Y， YAO T， PAN Y， et al. Contextual Transformer networks for visual recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2004， 45（2）： 1489-1500.
22	DAI Z， LIU H， LE Q V， et al. CoAtNet： marrying convolution and attention for all data sizes［C］// Proceedings of the 35th Conference on Neural Information Processing Systems （2021）［2022-06-17］..
23	PAN X， GE C， LU R， et al. On the integration of self-attention and convolution［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 805-815. 10.1109/cvpr52688.2022.00089
24	RONNEBERGER O， FISCHER P， BROX T. U-net： convolutional networks for biomedical image segmentation［C］// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention， LNCS 9351. Cham： Springer， 2015： 234-241.
25	JIANG Y， CHANG S， WANG Z. TransGAN： two pure transformers can make one strong GAN， and that can scale up［C］// Proceedings of the 35th Conference on Neural Information Processing Systems （2021）［2022-06-17］..
26	HUDSON D A， ZITNICK C L. Generative adversarial transformers［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 4487-4499.
27	JOHNSON J， ALAHI A， LI F F. Perceptual losses for real-time style transfer and super-resolution［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9906. Cham： Springer， 2016： 694-711.
28	SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［EB/OL］. （2015-04-10）［2022-06-17］..
29	ISOLA P， ZHU J Y， ZHOU T， et al. Image-to-image translation with conditional adversarial networks［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5967-5976. 10.1109/cvpr.2017.632
30	GULRAJANI I， AHMED F， ARJOVSKY M， et al. Improved training of Wasserstein GANs［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 5769-5779.
31	LIU Z， LUO P， QIU S， et al. DeepFashion： powering robust clothes recognition and retrieval with rich annotations［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1096-1104. 10.1109/cvpr.2016.124
32	ZHENG L， SHEN L， TIAN L， et al. Scalable person re-identification： a benchmark［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1116-1124. 10.1109/iccv.2015.133
33	CAO Z， SIMON T， WEI S E， et al. Realtime multi-person 2D pose estimation using part affinity fields［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1302-1310. 10.1109/cvpr.2017.143
34	WANG Z， BOVIK A C， SHEIKH H R， et al. Image quality assessment： from error visibility to structural similarity［J］. IEEE Transactions on Image Processing， 2004， 13（4）： 600-612. 10.1109/tip.2003.819861
35	HEUSEL M， RAMSAUER H， UNTERTHINER T， et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6629-6640. 10.48550/arXiv.1706.08500
36	ZHANG R， ISOLA P， EFROS A A， et al. The unreasonable effectiveness of deep features as a perceptual metric［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 586-595. 10.1109/cvpr.2018.00068
37	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2022-06-17］..
38	MEN Y， MAO Y， JIANG Y， et al. Controllable person image synthesis with attribute-decomposed GAN［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 5083-5092. 10.1109/cvpr42600.2020.00513

模块	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
CoT	0.772 1	18.644 3	13.339 9	0.213 3
Transformer	0.776 4	19.007 8	11.332 7	0.197 2
方案（a）	0.778 7	19.053 7	11.448 7	0.196 4
方案（b）	0.779 0	19.057 4	11.360 9	0.195 4
方案（c）	0.779 2	19.076 5	11.203 1	0.193 7
方案（d）	0.779 8	19.089 3	11.135 9	0.193 6

模块	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
CoT	0.772 1	18.644 3	13.339 9	0.213 3
Transformer	0.776 4	19.007 8	11.332 7	0.197 2
方案（a）	0.778 7	19.053 7	11.448 7	0.196 4
方案（b）	0.779 0	19.057 4	11.360 9	0.195 4
方案（c）	0.779 2	19.076 5	11.203 1	0.193 7
方案（d）	0.779 8	19.089 3	11.135 9	0.193 6

注意力头数量	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
1	0.776 2	18.922 1	11.869 5	0.202 0
2	0.778 1	18.999 9	11.470 0	0.196 1
4	0.778 6	19.035 0	11.386 5	0.195 0
8	0.779 8	19.089 3	11.135 9	0.193 6
16	0.779 5	19.078 0	11.018 4	0.193 8

注意力头数量	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
1	0.776 2	18.922 1	11.869 5	0.202 0
2	0.778 1	18.999 9	11.470 0	0.196 1
4	0.778 6	19.035 0	11.386 5	0.195 0
8	0.779 8	19.089 3	11.135 9	0.193 6
16	0.779 5	19.078 0	11.018 4	0.193 8

模型	DeepFashion				Market-1501
模型	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
PG2	0.773 0	17.532 4	49.567 4	0.292 8	0.270 4	14.174 9	86.028 8	0.361 9
PATN	0.771 7	18.254 3	20.750 0	0.253 6	0.281 8	14.262 2	22.681 4	0.319 4
ADGAN	0.771 9	18.376 8	14.483 3	0.225 6	—	—	—	—
DIST	0.767 7	18.573 7	10.842 9	0.225 8	0.280 8	14.336 8	19.740 3	0.281 5
PISE	0.768 2	18.520 8	11.514 4	0.208 0	—	—	—	—
SPIG	0.775 8	18.586 7	12.702 7	0.210 2	0.313 9	14.489 4	23.057 3	0.277 7
DPTN	0.778 2	19.149 2	11.466 4	0.195 7	0.285 4	14.520 7	18.994 6	0.271 1
本文模型	0.779 8	19.089 3	11.135 9	0.193 6	0.287 7	14.572 7	24.690 7	0.275 8

融合卷积与多头注意力的人体姿态迁移模型

Human pose transfer model combining convolution and multi-head attention

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 38

相关文章 15

编辑推荐

Metrics

[1]	薛桂香, 王辉, 周卫峰, 刘瑜, 李岩. 基于知识图谱和时空扩散图卷积网络的港口交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2952-2957.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	庞川林, 唐睿, 张睿智, 刘川, 刘佳, 岳士博. D2D通信系统中基于图卷积网络的分布式功率控制算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2855-2862.
[4]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[5]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[6]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[7]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[8]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[9]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[10]	汪才钦, 周渝皓, 张顺香, 王琰慧, 王小龙. 基于语境增强的新能源汽车投诉文本方面-观点对抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2430-2436.
[11]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[12]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[13]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[14]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[15]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.