《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3403-3410.DOI: 10.11772/j.issn.1001-9081.2022111707
所属专题: 人工智能
收稿日期:
2022-11-18
修回日期:
2022-12-25
接受日期:
2022-12-28
发布日期:
2023-11-14
出版日期:
2023-11-10
通讯作者:
杨红
作者简介:
杨红(1977—),女,辽宁葫芦岛人,副教授,博士,主要研究方向:数据挖掘、行为识别 yanghong@dlmu.edu.cn
Hong YANG(), He ZHANG, Shaoning JIN
Received:
2022-11-18
Revised:
2022-12-25
Accepted:
2022-12-28
Online:
2023-11-14
Published:
2023-11-10
Contact:
Hong YANG
About author:
YANG Hong, born in 1977, Ph. D., associate professor. Her research interests include data mining, behavior recognition.摘要:
对于给定某个人物的参考图像,人体姿态迁移(HPT)的目标是生成任意姿态下的该人物图像。许多现有的相关方法在捕捉人物外观细节、推测不可见区域方面仍存在不足,特别是对于复杂的姿态变换,难以生成清晰逼真的人物外观。为了解决以上问题,提出一种新颖的融合卷积与多头注意力的HPT模型。首先,融合卷积与多头注意力机制构建卷积-多头注意力(Conv-MHA)模块,提取丰富的上下文特征;其次,利用Conv-MHA模块构建HPT网络,提升所提模型的学习能力;最后,引入参考图像的自我重建作为辅助任务,更充分地发挥所提模型的性能。在DeepFashion和Market-1501数据集上验证了基于Conv-MHA的HPT模型,结果显示:它在DeepFashion测试集上的结构相似性(SSIM)、感知相似度(LPIPS)和FID(Fréchet Inception Distance)指标均优于现有的HPT模型DPTN (Dual-task Pose Transformer Network)。实验结果表明,融合卷积与多头注意力机制的Conv-MHA模块可以提升模型的表示能力,更加有效地捕捉人物外观细节,提升人物图像生成的精度。
中图分类号:
杨红, 张贺, 靳少宁. 融合卷积与多头注意力的人体姿态迁移模型[J]. 计算机应用, 2023, 43(11): 3403-3410.
Hong YANG, He ZHANG, Shaoning JIN. Human pose transfer model combining convolution and multi-head attention[J]. Journal of Computer Applications, 2023, 43(11): 3403-3410.
模块 | SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ |
---|---|---|---|---|
CoT | 0.772 1 | 18.644 3 | 13.339 9 | 0.213 3 |
Transformer | 0.776 4 | 19.007 8 | 11.332 7 | 0.197 2 |
方案(a) | 0.778 7 | 19.053 7 | 11.448 7 | 0.196 4 |
方案(b) | 0.779 0 | 19.057 4 | 11.360 9 | 0.195 4 |
方案(c) | ||||
方案(d) | 0.779 8 | 19.089 3 | 11.135 9 | 0.193 6 |
表1 不同模块的量化评估
Tab. 1 Quantitative evaluation of different blocks
模块 | SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ |
---|---|---|---|---|
CoT | 0.772 1 | 18.644 3 | 13.339 9 | 0.213 3 |
Transformer | 0.776 4 | 19.007 8 | 11.332 7 | 0.197 2 |
方案(a) | 0.778 7 | 19.053 7 | 11.448 7 | 0.196 4 |
方案(b) | 0.779 0 | 19.057 4 | 11.360 9 | 0.195 4 |
方案(c) | ||||
方案(d) | 0.779 8 | 19.089 3 | 11.135 9 | 0.193 6 |
注意力头数量 | SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ |
---|---|---|---|---|
1 | 0.776 2 | 18.922 1 | 11.869 5 | 0.202 0 |
2 | 0.778 1 | 18.999 9 | 11.470 0 | 0.196 1 |
4 | 0.778 6 | 19.035 0 | 11.386 5 | 0.195 0 |
8 | 0.779 8 | 19.089 3 | 0.193 6 | |
16 | 11.018 4 |
表2 注意力头数量的量化评估
Tab. 2 Quantitative evaluation of number of attention heads
注意力头数量 | SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ |
---|---|---|---|---|
1 | 0.776 2 | 18.922 1 | 11.869 5 | 0.202 0 |
2 | 0.778 1 | 18.999 9 | 11.470 0 | 0.196 1 |
4 | 0.778 6 | 19.035 0 | 11.386 5 | 0.195 0 |
8 | 0.779 8 | 19.089 3 | 0.193 6 | |
16 | 11.018 4 |
模型 | DeepFashion | Market-1501 | ||||||
---|---|---|---|---|---|---|---|---|
SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ | SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ | |
PG2 | 0.773 0 | 17.532 4 | 49.567 4 | 0.292 8 | 0.270 4 | 14.174 9 | 86.028 8 | 0.361 9 |
PATN | 0.771 7 | 18.254 3 | 20.750 0 | 0.253 6 | 0.281 8 | 14.262 2 | 22.681 4 | 0.319 4 |
ADGAN | 0.771 9 | 18.376 8 | 14.483 3 | 0.225 6 | — | — | — | — |
DIST | 0.767 7 | 18.573 7 | 10.842 9 | 0.225 8 | 0.280 8 | 14.336 8 | 0.281 5 | |
PISE | 0.768 2 | 18.520 8 | 11.514 4 | 0.208 0 | — | — | — | — |
SPIG | 0.775 8 | 18.586 7 | 12.702 7 | 0.210 2 | 0.313 9 | 14.489 4 | 23.057 3 | 0.277 7 |
DPTN | 19.149 2 | 11.466 4 | 0.285 4 | 18.994 6 | 0.271 1 | |||
本文模型 | 0.779 8 | 0.193 6 | 14.572 7 | 24.690 7 |
表3 不同模型的结果对比
Tab. 3 Comparison of results of different models
模型 | DeepFashion | Market-1501 | ||||||
---|---|---|---|---|---|---|---|---|
SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ | SSIM↑ | PSNR/dB↑ | FID↓ | LPIPS↓ | |
PG2 | 0.773 0 | 17.532 4 | 49.567 4 | 0.292 8 | 0.270 4 | 14.174 9 | 86.028 8 | 0.361 9 |
PATN | 0.771 7 | 18.254 3 | 20.750 0 | 0.253 6 | 0.281 8 | 14.262 2 | 22.681 4 | 0.319 4 |
ADGAN | 0.771 9 | 18.376 8 | 14.483 3 | 0.225 6 | — | — | — | — |
DIST | 0.767 7 | 18.573 7 | 10.842 9 | 0.225 8 | 0.280 8 | 14.336 8 | 0.281 5 | |
PISE | 0.768 2 | 18.520 8 | 11.514 4 | 0.208 0 | — | — | — | — |
SPIG | 0.775 8 | 18.586 7 | 12.702 7 | 0.210 2 | 0.313 9 | 14.489 4 | 23.057 3 | 0.277 7 |
DPTN | 19.149 2 | 11.466 4 | 0.285 4 | 18.994 6 | 0.271 1 | |||
本文模型 | 0.779 8 | 0.193 6 | 14.572 7 | 24.690 7 |
1 | GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge: MIT Press, 2014: 2672-2680. |
2 | KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. (2022-12-10) [2023-03-17].. 10.1561/2200000056 |
3 | MA L, JIA X, SUN Q, et al. Pose guided person image generation[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 405-415. |
4 | ESSER P, SUTTER E. A variational U-Net for conditional appearance and shape generation[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8857-8866. 10.1109/cvpr.2018.00923 |
5 | LI Y, HUANG C, LOY C C. Dense intrinsic appearance flow for human pose transfer[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3688-3697. 10.1109/cvpr.2019.00381 |
6 | REN Y, YU X, CHEN J, et al. Deep image spatial transformation for person image generation[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 7687-7696. 10.1109/cvpr42600.2020.00771 |
7 | LV Z, LI X, LI X, et al. Learning semantic person image generation by region-adaptive normalization[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 10801-10810. 10.1109/cvpr46437.2021.01066 |
8 | ZHANG J, LI K, LAI Y K, et al. PISE: person image synthesis and editing with decoupled GAN[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 7978-7986. 10.1109/cvpr46437.2021.00789 |
9 | TANG H, BAI S, ZHANG L, et al. XingGAN for person image generation[C]// Proceedings of the 2020 European Conference on Computer Vision, LNCS 12370. Cham: Springer, 2020: 717-734. |
10 | ZHU Z, HUANG T, SHI B, et al. Progressive pose attention transfer for person image generation[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 2342-2351. 10.1109/cvpr.2019.00245 |
11 | ZHANG P, YANG L, LAI J, et al. Exploring dual-task correlation for pose guided person image generation[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 7703-7712. 10.1109/cvpr52688.2022.00756 |
12 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. |
13 | HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. 10.1109/cvpr.2018.00745 |
14 | LI X, WANG W, HU X, et al. Selective kernel networks[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 510-519. 10.1109/cvpr.2019.00060 |
15 | WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11211. Cham: Springer, 2018: 3-19. |
16 | SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck Transformers for visual recognition[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 16514-16524. 10.1109/cvpr46437.2021.01625 |
17 | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-06-17].. |
18 | LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision Tansformer using shifted windows[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002. 10.1109/iccv48922.2021.00986 |
19 | DONG X, BAO J, CHEN D, et al. CSWin Transformer: a general vision Transformer backbone with cross-shaped windows[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 12114-12124. 10.1109/cvpr52688.2022.01181 |
20 | VASWANI A, RAMACHANDRAN P, SRINIVAS A, et al. Scaling local self-attention for parameter efficient visual backbones[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 12889-12899. 10.1109/cvpr46437.2021.01270 |
21 | LI Y, YAO T, PAN Y, et al. Contextual Transformer networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 45(2): 1489-1500. |
22 | DAI Z, LIU H, LE Q V, et al. CoAtNet: marrying convolution and attention for all data sizes[C]// Proceedings of the 35th Conference on Neural Information Processing Systems (2021) [2022-06-17].. |
23 | PAN X, GE C, LU R, et al. On the integration of self-attention and convolution[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 805-815. 10.1109/cvpr52688.2022.00089 |
24 | RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention, LNCS 9351. Cham: Springer, 2015: 234-241. |
25 | JIANG Y, CHANG S, WANG Z. TransGAN: two pure transformers can make one strong GAN, and that can scale up[C]// Proceedings of the 35th Conference on Neural Information Processing Systems (2021) [2022-06-17].. |
26 | HUDSON D A, ZITNICK C L. Generative adversarial transformers[C]// Proceedings of the 38th International Conference on Machine Learning. New York: JMLR.org, 2021: 4487-4499. |
27 | JOHNSON J, ALAHI A, LI F F. Perceptual losses for real-time style transfer and super-resolution[C]// Proceedings of the 2016 European Conference on Computer Vision, LNCS 9906. Cham: Springer, 2016: 694-711. |
28 | SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10) [2022-06-17].. |
29 | ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5967-5976. 10.1109/cvpr.2017.632 |
30 | GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 5769-5779. |
31 | LIU Z, LUO P, QIU S, et al. DeepFashion: powering robust clothes recognition and retrieval with rich annotations[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1096-1104. 10.1109/cvpr.2016.124 |
32 | ZHENG L, SHEN L, TIAN L, et al. Scalable person re-identification: a benchmark[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1116-1124. 10.1109/iccv.2015.133 |
33 | CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1302-1310. 10.1109/cvpr.2017.143 |
34 | WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612. 10.1109/tip.2003.819861 |
35 | HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6629-6640. 10.48550/arXiv.1706.08500 |
36 | ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586-595. 10.1109/cvpr.2018.00068 |
37 | KINGMA D P, BA J L. Adam: a method for stochastic optimization[EB/OL]. (2017-01-30) [2022-06-17].. |
38 | MEN Y, MAO Y, JIANG Y, et al. Controllable person image synthesis with attribute-decomposed GAN[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5083-5092. 10.1109/cvpr42600.2020.00513 |
[1] | 薛桂香, 王辉, 周卫峰, 刘瑜, 李岩. 基于知识图谱和时空扩散图卷积网络的港口交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2952-2957. |
[2] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[3] | 庞川林, 唐睿, 张睿智, 刘川, 刘佳, 岳士博. D2D通信系统中基于图卷积网络的分布式功率控制算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2855-2862. |
[4] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[5] | 赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892. |
[6] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[7] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
[8] | 张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371. |
[9] | 高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406. |
[10] | 汪才钦, 周渝皓, 张顺香, 王琰慧, 王小龙. 基于语境增强的新能源汽车投诉文本方面-观点对抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2430-2436. |
[11] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[12] | 王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994. |
[13] | 李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072. |
[14] | 高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242. |
[15] | 刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||