《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3403-3410.DOI: 10.11772/j.issn.1001-9081.2022111707

• 人工智能 • 上一篇    

融合卷积与多头注意力的人体姿态迁移模型

杨红(), 张贺, 靳少宁   

  1. 大连海事大学 信息科学技术学院,辽宁 大连 116026
  • 收稿日期:2022-11-18 修回日期:2022-12-25 接受日期:2022-12-28 发布日期:2023-11-14 出版日期:2023-11-10
  • 通讯作者: 杨红
  • 作者简介:杨红(1977—),女,辽宁葫芦岛人,副教授,博士,主要研究方向:数据挖掘、行为识别 yanghong@dlmu.edu.cn
    张贺(1998—),男,山东临沂人,硕士研究生,主要研究方向:图像生成、深度生成模型
    靳少宁(1996—),女,甘肃静宁人,硕士研究生,主要研究方向:步态识别、人工智能。

Human pose transfer model combining convolution and multi-head attention

Hong YANG(), He ZHANG, Shaoning JIN   

  1. Information Science and Technology College,Dalian Maritime University,Dalian Liaoning 116026,China
  • Received:2022-11-18 Revised:2022-12-25 Accepted:2022-12-28 Online:2023-11-14 Published:2023-11-10
  • Contact: Hong YANG
  • About author:YANG Hong, born in 1977, Ph. D., associate professor. Her research interests include data mining, behavior recognition.
    ZHANG He, born in 1998, M. S. candidate. His research interests include image generation, deep generative models.
    JIN Shaoning, born in 1996, M. S. candidate. Her research interests include gait recognition, artificial intelligence.

摘要:

对于给定某个人物的参考图像,人体姿态迁移(HPT)的目标是生成任意姿态下的该人物图像。许多现有的相关方法在捕捉人物外观细节、推测不可见区域方面仍存在不足,特别是对于复杂的姿态变换,难以生成清晰逼真的人物外观。为了解决以上问题,提出一种新颖的融合卷积与多头注意力的HPT模型。首先,融合卷积与多头注意力机制构建卷积-多头注意力(Conv-MHA)模块,提取丰富的上下文特征;其次,利用Conv-MHA模块构建HPT网络,提升所提模型的学习能力;最后,引入参考图像的自我重建作为辅助任务,更充分地发挥所提模型的性能。在DeepFashion和Market-1501数据集上验证了基于Conv-MHA的HPT模型,结果显示:它在DeepFashion测试集上的结构相似性(SSIM)、感知相似度(LPIPS)和FID(Fréchet Inception Distance)指标均优于现有的HPT模型DPTN (Dual-task Pose Transformer Network)。实验结果表明,融合卷积与多头注意力机制的Conv-MHA模块可以提升模型的表示能力,更加有效地捕捉人物外观细节,提升人物图像生成的精度。

关键词: 人体姿态迁移, 图像生成, 生成对抗网络, 多头注意力, 卷积

Abstract:

For a given reference image of a person, the goal of Human Pose Transfer (HPT) is to generate an image of that person in any arbitrary pose. Many existing related methods fail to capture the details of a person’s appearance and have difficulties in predicting invisible regions, especially for complex pose transformation, and it is difficult to generate a clear and realistic person’s appearance. To address the above problems, a new HPT model that integrated convolution and multi-head attention was proposed. Firstly, the Convolution-Multi-Head Attention (Conv-MHA) block was constructed by fusing the convolution and multi-head attention, then it was used to extract rich contextual features. Secondly, to improve the learning ability of the proposed model, the HPT network was constructed by using Conv-MHA block. Finally, the self-reconstruction of the reference image was introduced as an auxiliary task to make the model more fully utilized its performance. The Conv-MHA-based human pose transfer model was validated on DeepFashion and Market-1501 datasets, and the results on DeepFashion test dataset show that it outperforms the state-of-the-art human pose transfer model, DPTN (Dual-task Pose Transformer Network), in terms of Structural SIMilarity (SSIM), Learned Perceptual Image Patch Similarity (LPIPS) and FID (Fréchet Inception Distance) indicators. Experimental results show that the Conv-MHA module, which integrates convolution and multi-head attention mechanism, can improve the representation ability of the model, capture the details of person’s appearance more effectively, and improve the accuracy of person image generation.

Key words: Human Pose Transfer (HPT), image generation, generative adversarial network, multi-head attention, convolution

中图分类号: