WildVidFit: Video Virtual Try-on in the Wild via Image-Based Controlled Diffusion Models

Abstract

Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person’s pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model’s ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit’s capability to generate fluid and coherent videos. The project page website is at wildvidfit-project.github. io.

Cite

Text

He et al. "WildVidFit: Video Virtual Try-on in the Wild via Image-Based Controlled Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72643-9_8

Markdown

[He et al. "WildVidFit: Video Virtual Try-on in the Wild via Image-Based Controlled Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/he2024eccv-wildvidfit/) doi:10.1007/978-3-031-72643-9_8

BibTeX

@inproceedings{he2024eccv-wildvidfit,
  title     = {{WildVidFit: Video Virtual Try-on in the Wild via Image-Based Controlled Diffusion Models}},
  author    = {He, Zijian and Chen, Peixin and Wang, Guangrun and Li, Guanbin and Torr, Philip and Lin, Liang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72643-9_8},
  url       = {https://mlanthology.org/eccv/2024/he2024eccv-wildvidfit/}
}