Human-centric detail-enhanced virtual try-on method

doi:10.11772/j.issn.1001-9081.2025040475

Abstract

Abstract:

To address the limitations of current virtual try-on methods in preserving local details of target garments adequately， and the problem that when diffusion model is used for generation， the Variational AutoEncoder （VAE）'s mapping of input data to low-dimensional space leads to loss of high-frequency detailed features in model’s hands and face， a human-centric detail-enhanced virtual try-on method was proposed. Firstly， the clothing-agnostic human body map， human pose map， and target garment were input into a Geometric Matching Module （GMM） to generate a coarsely warped garment result. Secondly， a Garment Wrap Refinement （GWR） module was constructed to enhance the detailed features of the coarsely warped garment. Thirdly， the warped garment map， clothing-agnostic human body map， and human pose map were concatenated and fed into a UNet with textual features， and textual and image features were fused to generate a clear image progressively through denoising. Fourthly， a Mask Feature Connection （MFC） module was constructed， and a coordinate attention was introduced， so as to localize the model’s position more accurately and preserve high-frequency detailed features in hands and face， thereby ensuring human-centric results. Finally， the output of MFC module and UNet were fused and decoded to obtain the final try-on results. Experimental results demonstrate that the proposed method achieves a 1.41% improvement in Structural Similarity Index Measure （SSIM） metric on the Dress Code dataset， along with reductions of 7.32%， 31.03%， and 64.56% in Learned Perceptual Image Patch Similarity （LPIPS）， FID （Fréchet Inception Distance）， and KID （Kernel Inception Distance） metrics， respectively， compared to the LADI-VTON （LAtent DIffusion-Virtual Try-ON） method， verifying that the proposed method achieves superior performance in virtual try-on.

Key words: virtual try-on, detail enhancement, coordinate attention mechanism, human-centric, diffusion model

摘要：

针对当前虚拟试衣方法无法充分保留目标服装的局部细节的问题，以及使用扩散模型生成试衣结果时，变分自编码器（VAE）会将输入数据映射到低维空间，从而导致模特手部和脸部高频细节特征丢失的问题，提出一种以人为中心的细节增强虚拟试衣方法。首先，将服装不可知的人体图、人体姿态图和目标服装输入几何匹配模块（GMM）以得到粗扭曲服装结果；其次，构建服装扭曲细化（GWR）模块增强粗扭曲服装的细节特征；再次，将服装扭曲图、服装不可知的人体图以及人体姿态图等拼接后和文本特征作为UNet的输入，融合文本特征与图像特征通过去噪逐步生成清晰的图像；继次，构建掩码特征连接（MFC）模块，引入坐标注意力机制，更准确地定位模特的位置信息，保留模特手部和脸部的高频细节特征，实现以人为中心的结果；最后，将MFC模块的输出与UNet的输出进行融合解码，得到最终的试衣结果。实验结果表明，与LADI-VTON（LAtent DIffusion-Virtual Try-ON）方法相比，所提方法在Dress Code数据集上的结构相似度指数（SSIM）指标提升了1.41%，在感知相似度（LPIPS）、FID（Fréchet Inception Distance）和KID（Kernel Inception Distance）指标上分别降低了7.32%、31.03%和64.56%，验证了所提方法的虚拟试衣效果更优。

关键词: 虚拟试衣, 细节增强, 坐标注意力机制, 以人为中心, 扩散模型

CLC Number:

TP391.41

Peirong SHAO, Suzhen LIN, Yanbo WANG. Human-centric detail-enhanced virtual try-on method[J]. Journal of Computer Applications, 2026, 46(3): 915-923.

邵培荣, 蔺素珍, 王彦博. 以人为中心的细节增强虚拟试衣方法[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 915-923.

Figures/Tables 13

References 25

[1]	PONS-MOLL G， PUJADES S， HU S， et al. ClothCap： seamless 4D clothing capture and retargeting ［J］. ACM Transactions on Graphics， 2017， 36（4）： No.73.
[2]	PATEL C， LIAO Z， PONS-MOLL G. TailorNet： predicting clothing in 3D as a function of human pose， shape and garment style［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 7363-7373.
[3]	胡新荣，张君宇，彭涛，等. 级联跨域特征融合的虚拟试衣［J］. 计算机应用， 2022， 42（4）： 1269-1274.
	HU X R， ZHANG J Y， PENG T， et al. Cascaded cross-domain feature fusion for virtual try-on ［J］. Journal of Computer Applications， 2022， 42（4）： 1269-1274.
[4]	CHOI S， PARK S， LEE M， et al. VITON-HD： high-resolution virtual try-on via misalignment-aware normalization ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 14126-14135.
[5]	LEWIS K M， VARADHARAJAN S， KEMELMACHER-SHLIZERMAN I. TryOnGAN： body-aware try-on via layered interpolation ［J］. ACM Transactions on Graphics， 2021， 40（4）： No.115.
[6]	LEE S， GU G， PARK S， et al. High-resolution virtual try-on with misalignment and occlusion-handled conditions ［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13677. Cham： Springer， 2022： 204-219.
[7]	XIE Z， HUANG Z， DONG X， et al. GP-VTON： towards general purpose virtual try-on via collaborative local-flow global-parsing learning ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 23550-23559.
[8]	YANG Z， CHEN J， SHI Y， et al. OccluMix： towards de-occlusion virtual try-on by semantically-guided mixup ［J］. IEEE Transactions on Multimedia， 2023， 25： 1477-1488.
[9]	CHOI Y， KWAK S， LEE K， et al. Improving diffusion models for authentic virtual try-on in the wild ［C］// Proceedings of the 2024 European Conference on Computer Vision， LNCS 15144. Cham： Springer， 2025： 206-235.
[10]	KIM J， GU G， PARK M， et al. Stable VITON： learning semantic correspondence with latent diffusion model for virtual try-on ［C］// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2024： 8176-8185.
[11]	XU Y， GU T， CHEN W， et al. OOTDiffusion： outfitting fusion based latent diffusion for controllable virtual try-on ［C］// Proceedings of the 39th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2025： 8996-9004.
[12]	MORELLI D， BALDRATI A， CARTELLA G， et al. LaDI-VTON： latent diffusion textual-inversion enhanced virtual try-on［C］// Proceedings of the 31st ACM International Conference on Multimedia. New York： ACM， 2023： 8580-8589.
[13]	GOODFELLOW I， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial networks ［J］. Communications of the ACM， 2020， 63（11）： 139-144.
[14]	ROMBACH R， BLATTMANN A， LORENZ D， et al. High-resolution image synthesis with latent diffusion models ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10674-10685.
[15]	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision ［C］// Proceedings of the 2021 International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
[16]	RONNEBERGER O， FISCHER P， BROX T. U-Net： convolutional networks for biomedical image segmentation ［C］// Proceedings of the 2015 Medical Image Computing and Computer-Assisted Intervention， LNCS 9351. Cham： Springer， 2015： 234-241.
[17]	WANG B， ZHENG H， LIANG X， et al. Toward characteristic-preserving image-based virtual try-on network ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11217. Cham： Springer， 2018： 607-623.
[18]	DUCHON J. Splines minimizing rotation-invariant semi-norms in Sobolev spaces ［C］// Constructive Theory of Functions of Several Variables： Proceedings of a Conference Held at Oberwolfach， April 25 — May 1， 1976， LNM 571. Berlin： Springer， 1977： 85-100.
[19]	ISOLA P， ZHU J Y， ZHOU T， et al. Image-to-image translation with conditional adversarial networks ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5967-5976.
[20]	MORELLI D， FINCATO M， CORNIA M， et al. Dress Code： high-resolution multi-category virtual try-on ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2022： 2230-2234.
[21]	CAO Z， SIMON T， WEI S E， et al. Realtime multi-person 2D pose estimation using part affinity fields ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1302-1310.
[22]	CHEN Z， HE Z， LU Z M. DEA-Net： single image dehazing based on detail-enhanced convolution and content-guided attention ［J］. IEEE Transactions on Image Processing， 2024， 33： 1002-1015.
[23]	DENG J， DONG W， SOCHER R， et al. ImageNet： a large-scale hierarchical image database ［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 248-255.
[24]	ZHANG S， HAN X， ZHANG W， et al. Limb-aware virtual try-on network with progressive clothing warping ［J］. IEEE Transactions on Multimedia， 2024， 26： 1731-1746.
[25]	ZHANG X， CHEN J， MA L， et al. A virtual try-on network with arm region preservation ［J］. Applied Soft Computing， 2025， 175： No.112960.

层名	网络结构	输出尺寸
input	—	（24，512，384）
inputConv	DoubleConv	（64，512，384）
down1	MaxPool2d，DoubleConv，DEABlock	（128，256，192）
down2	MaxPool2d，DoubleConv，DEABlock	（256，128，96）
down3	MaxPool2d，DoubleConv，DEABlock	（512，64，48）
down4	MaxPool2d，DoubleConv，DEABlock	（512，32，24）
up1	Unsample，DoubleConv	（256，64，48）
up2	Unsample，DoubleConv	（128，128，96）
up3	Unsample，DoubleConv	（64，256，192）
up4	Unsample，DoubleConv	（64，512，384）

层名	网络结构	输出尺寸
input	—	（24，512，384）
inputConv	DoubleConv	（64，512，384）
down1	MaxPool2d，DoubleConv，DEABlock	（128，256，192）
down2	MaxPool2d，DoubleConv，DEABlock	（256，128，96）
down3	MaxPool2d，DoubleConv，DEABlock	（512，64，48）
down4	MaxPool2d，DoubleConv，DEABlock	（512，32，24）
up1	Unsample，DoubleConv	（256，64，48）
up2	Unsample，DoubleConv	（128，128，96）
up3	Unsample，DoubleConv	（64，256，192）
up4	Unsample，DoubleConv	（64，512，384）

方法	FID	KID	SSIM	LPIPS
HR-VTON	12.20	3.79	0.813 9	0.202 8
GP-VTON	9.66	1.58	0.820 8	0.216 1
LADI-VTON	9.41	1.60	0.814 9	0.202 6
IDM-VTON	9.26	1.40	0.808 1	0.210 1
OOTDiffusion	9.59	1.53	0.798 6	0.215 2
本文方法	9.05	1.42	0.823 7	0.197 2

方法	FID	KID	SSIM	LPIPS
HR-VTON	12.20	3.79	0.813 9	0.202 8
GP-VTON	9.66	1.58	0.820 8	0.216 1
LADI-VTON	9.41	1.60	0.814 9	0.202 6
IDM-VTON	9.26	1.40	0.808 1	0.210 1
OOTDiffusion	9.59	1.53	0.798 6	0.215 2
本文方法	9.05	1.42	0.823 7	0.197 2

方法	上半身服装		下半身服装		裙子		所有类别
方法	FID	KID	FID	KID	FID	KID	FID	KID	SSIM	LPIPS
GP-VTON	12.46	1.52	16.92	3.29	13.25	2.27	6.11	1.55	0.753 9	0.300 2
LADI-VTON	13.96	3.02	14.62	3.03	14.61	3.32	6.96	2.37	0.871 1	0.133 9
IDM-VTON	11.90	1.42	17.76	6.06	14.04	3.38	6.95	2.74	0.862 3	0.152 1
OOTDiffusion	12.28	1.56	16.48	4.62	14.68	3.42	6.23	2.97	0.849 9	0.138 2
本文方法	10.81	0.77	12.40	1.82	11.70	1.40	4.80	0.84	0.883 4	0.124 1