FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Abstract

In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL.

Cite

Text

Han et al. "FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00262

Markdown

[Han et al. "FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/han2023cvpr-famevil/) doi:10.1109/CVPR52729.2023.00262

BibTeX

@inproceedings{han2023cvpr-famevil,
  title     = {{FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks}},
  author    = {Han, Xiao and Zhu, Xiatian and Yu, Licheng and Zhang, Li and Song, Yi-Zhe and Xiang, Tao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {2669-2680},
  doi       = {10.1109/CVPR52729.2023.00262},
  url       = {https://mlanthology.org/cvpr/2023/han2023cvpr-famevil/}
}