Human pose transfer model combining convolution and multi-head attention

doi:10.11772/j.issn.1001-9081.2022111707

Abstract

Abstract:

For a given reference image of a person， the goal of Human Pose Transfer （HPT） is to generate an image of that person in any arbitrary pose. Many existing related methods fail to capture the details of a person’s appearance and have difficulties in predicting invisible regions， especially for complex pose transformation， and it is difficult to generate a clear and realistic person’s appearance. To address the above problems， a new HPT model that integrated convolution and multi-head attention was proposed. Firstly， the Convolution-Multi-Head Attention （Conv-MHA） block was constructed by fusing the convolution and multi-head attention， then it was used to extract rich contextual features. Secondly， to improve the learning ability of the proposed model， the HPT network was constructed by using Conv-MHA block. Finally， the self-reconstruction of the reference image was introduced as an auxiliary task to make the model more fully utilized its performance. The Conv-MHA-based human pose transfer model was validated on DeepFashion and Market-1501 datasets， and the results on DeepFashion test dataset show that it outperforms the state-of-the-art human pose transfer model， DPTN （Dual-task Pose Transformer Network）， in terms of Structural SIMilarity （SSIM）， Learned Perceptual Image Patch Similarity （LPIPS） and FID （Fréchet Inception Distance） indicators. Experimental results show that the Conv-MHA module， which integrates convolution and multi-head attention mechanism， can improve the representation ability of the model， capture the details of person’s appearance more effectively， and improve the accuracy of person image generation.

Key words: Human Pose Transfer (HPT), image generation, generative adversarial network, multi-head attention, convolution

摘要：

对于给定某个人物的参考图像，人体姿态迁移（HPT）的目标是生成任意姿态下的该人物图像。许多现有的相关方法在捕捉人物外观细节、推测不可见区域方面仍存在不足，特别是对于复杂的姿态变换，难以生成清晰逼真的人物外观。为了解决以上问题，提出一种新颖的融合卷积与多头注意力的HPT模型。首先，融合卷积与多头注意力机制构建卷积-多头注意力（Conv-MHA）模块，提取丰富的上下文特征；其次，利用Conv-MHA模块构建HPT网络，提升所提模型的学习能力；最后，引入参考图像的自我重建作为辅助任务，更充分地发挥所提模型的性能。在DeepFashion和Market-1501数据集上验证了基于Conv-MHA的HPT模型，结果显示：它在DeepFashion测试集上的结构相似性（SSIM）、感知相似度（LPIPS）和FID（Fréchet Inception Distance）指标均优于现有的HPT模型DPTN （Dual-task Pose Transformer Network）。实验结果表明，融合卷积与多头注意力机制的Conv-MHA模块可以提升模型的表示能力，更加有效地捕捉人物外观细节，提升人物图像生成的精度。

关键词: 人体姿态迁移, 图像生成, 生成对抗网络, 多头注意力, 卷积

CLC Number:

TP183

Hong YANG, He ZHANG, Shaoning JIN. Human pose transfer model combining convolution and multi-head attention[J]. Journal of Computer Applications, 2023, 43(11): 3403-3410.

杨红, 张贺, 靳少宁. 融合卷积与多头注意力的人体姿态迁移模型[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3403-3410.

Figures/Tables 9

References 38

1	GOODFELLOW I J， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial nets［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge： MIT Press， 2014： 2672-2680.
2	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. （2022-12-10）［2023-03-17］.. 10.1561/2200000056
3	MA L， JIA X， SUN Q， et al. Pose guided person image generation［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 405-415.
4	ESSER P， SUTTER E. A variational U-Net for conditional appearance and shape generation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8857-8866. 10.1109/cvpr.2018.00923
5	LI Y， HUANG C， LOY C C. Dense intrinsic appearance flow for human pose transfer［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 3688-3697. 10.1109/cvpr.2019.00381
6	REN Y， YU X， CHEN J， et al. Deep image spatial transformation for person image generation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 7687-7696. 10.1109/cvpr42600.2020.00771
7	LV Z， LI X， LI X， et al. Learning semantic person image generation by region-adaptive normalization［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 10801-10810. 10.1109/cvpr46437.2021.01066
8	ZHANG J， LI K， LAI Y K， et al. PISE： person image synthesis and editing with decoupled GAN［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 7978-7986. 10.1109/cvpr46437.2021.00789
9	TANG H， BAI S， ZHANG L， et al. XingGAN for person image generation［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12370. Cham： Springer， 2020： 717-734.
10	ZHU Z， HUANG T， SHI B， et al. Progressive pose attention transfer for person image generation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 2342-2351. 10.1109/cvpr.2019.00245
11	ZHANG P， YANG L， LAI J， et al. Exploring dual-task correlation for pose guided person image generation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 7703-7712. 10.1109/cvpr52688.2022.00756
12	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
13	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
14	LI X， WANG W， HU X， et al. Selective kernel networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 510-519. 10.1109/cvpr.2019.00060
15	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
16	SRINIVAS A， LIN T Y， PARMAR N， et al. Bottleneck Transformers for visual recognition［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 16514-16524. 10.1109/cvpr46437.2021.01625
17	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16×16 words： Transformers for image recognition at scale［EB/OL］. （2021-06-03）［2022-06-17］..
18	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Tansformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002. 10.1109/iccv48922.2021.00986
19	DONG X， BAO J， CHEN D， et al. CSWin Transformer： a general vision Transformer backbone with cross-shaped windows［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 12114-12124. 10.1109/cvpr52688.2022.01181
20	VASWANI A， RAMACHANDRAN P， SRINIVAS A， et al. Scaling local self-attention for parameter efficient visual backbones［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 12889-12899. 10.1109/cvpr46437.2021.01270
21	LI Y， YAO T， PAN Y， et al. Contextual Transformer networks for visual recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2004， 45（2）： 1489-1500.
22	DAI Z， LIU H， LE Q V， et al. CoAtNet： marrying convolution and attention for all data sizes［C］// Proceedings of the 35th Conference on Neural Information Processing Systems （2021）［2022-06-17］..
23	PAN X， GE C， LU R， et al. On the integration of self-attention and convolution［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 805-815. 10.1109/cvpr52688.2022.00089
24	RONNEBERGER O， FISCHER P， BROX T. U-net： convolutional networks for biomedical image segmentation［C］// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention， LNCS 9351. Cham： Springer， 2015： 234-241.
25	JIANG Y， CHANG S， WANG Z. TransGAN： two pure transformers can make one strong GAN， and that can scale up［C］// Proceedings of the 35th Conference on Neural Information Processing Systems （2021）［2022-06-17］..
26	HUDSON D A， ZITNICK C L. Generative adversarial transformers［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 4487-4499.
27	JOHNSON J， ALAHI A， LI F F. Perceptual losses for real-time style transfer and super-resolution［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9906. Cham： Springer， 2016： 694-711.
28	SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［EB/OL］. （2015-04-10）［2022-06-17］..
29	ISOLA P， ZHU J Y， ZHOU T， et al. Image-to-image translation with conditional adversarial networks［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5967-5976. 10.1109/cvpr.2017.632
30	GULRAJANI I， AHMED F， ARJOVSKY M， et al. Improved training of Wasserstein GANs［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 5769-5779.
31	LIU Z， LUO P， QIU S， et al. DeepFashion： powering robust clothes recognition and retrieval with rich annotations［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1096-1104. 10.1109/cvpr.2016.124
32	ZHENG L， SHEN L， TIAN L， et al. Scalable person re-identification： a benchmark［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1116-1124. 10.1109/iccv.2015.133
33	CAO Z， SIMON T， WEI S E， et al. Realtime multi-person 2D pose estimation using part affinity fields［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1302-1310. 10.1109/cvpr.2017.143
34	WANG Z， BOVIK A C， SHEIKH H R， et al. Image quality assessment： from error visibility to structural similarity［J］. IEEE Transactions on Image Processing， 2004， 13（4）： 600-612. 10.1109/tip.2003.819861
35	HEUSEL M， RAMSAUER H， UNTERTHINER T， et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6629-6640. 10.48550/arXiv.1706.08500
36	ZHANG R， ISOLA P， EFROS A A， et al. The unreasonable effectiveness of deep features as a perceptual metric［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 586-595. 10.1109/cvpr.2018.00068
37	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2022-06-17］..
38	MEN Y， MAO Y， JIANG Y， et al. Controllable person image synthesis with attribute-decomposed GAN［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 5083-5092. 10.1109/cvpr42600.2020.00513

模块	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
CoT	0.772 1	18.644 3	13.339 9	0.213 3
Transformer	0.776 4	19.007 8	11.332 7	0.197 2
方案（a）	0.778 7	19.053 7	11.448 7	0.196 4
方案（b）	0.779 0	19.057 4	11.360 9	0.195 4
方案（c）	0.779 2	19.076 5	11.203 1	0.193 7
方案（d）	0.779 8	19.089 3	11.135 9	0.193 6

模块	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
CoT	0.772 1	18.644 3	13.339 9	0.213 3
Transformer	0.776 4	19.007 8	11.332 7	0.197 2
方案（a）	0.778 7	19.053 7	11.448 7	0.196 4
方案（b）	0.779 0	19.057 4	11.360 9	0.195 4
方案（c）	0.779 2	19.076 5	11.203 1	0.193 7
方案（d）	0.779 8	19.089 3	11.135 9	0.193 6

注意力头数量	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
1	0.776 2	18.922 1	11.869 5	0.202 0
2	0.778 1	18.999 9	11.470 0	0.196 1
4	0.778 6	19.035 0	11.386 5	0.195 0
8	0.779 8	19.089 3	11.135 9	0.193 6
16	0.779 5	19.078 0	11.018 4	0.193 8

注意力头数量	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
1	0.776 2	18.922 1	11.869 5	0.202 0
2	0.778 1	18.999 9	11.470 0	0.196 1
4	0.778 6	19.035 0	11.386 5	0.195 0
8	0.779 8	19.089 3	11.135 9	0.193 6
16	0.779 5	19.078 0	11.018 4	0.193 8

模型	DeepFashion				Market-1501
模型	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓	SSIM↑	PSNR/dB↑	FID↓	LPIPS↓
PG2	0.773 0	17.532 4	49.567 4	0.292 8	0.270 4	14.174 9	86.028 8	0.361 9
PATN	0.771 7	18.254 3	20.750 0	0.253 6	0.281 8	14.262 2	22.681 4	0.319 4
ADGAN	0.771 9	18.376 8	14.483 3	0.225 6	—	—	—	—
DIST	0.767 7	18.573 7	10.842 9	0.225 8	0.280 8	14.336 8	19.740 3	0.281 5
PISE	0.768 2	18.520 8	11.514 4	0.208 0	—	—	—	—
SPIG	0.775 8	18.586 7	12.702 7	0.210 2	0.313 9	14.489 4	23.057 3	0.277 7
DPTN	0.778 2	19.149 2	11.466 4	0.195 7	0.285 4	14.520 7	18.994 6	0.271 1
本文模型	0.779 8	19.089 3	11.135 9	0.193 6	0.287 7	14.572 7	24.690 7	0.275 8