MELTR: Meta Loss Transformer for Learning to Fine-Tune Video Foundation Models

Abstract

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately 'transforms' individual loss functions and 'melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.

Cite

Text

Ko et al. "MELTR: Meta Loss Transformer for Learning to Fine-Tune Video Foundation Models." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01925

Markdown

[Ko et al. "MELTR: Meta Loss Transformer for Learning to Fine-Tune Video Foundation Models." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/ko2023cvpr-meltr/) doi:10.1109/CVPR52729.2023.01925

BibTeX

@inproceedings{ko2023cvpr-meltr,
  title     = {{MELTR: Meta Loss Transformer for Learning to Fine-Tune Video Foundation Models}},
  author    = {Ko, Dohwan and Choi, Joonmyung and Choi, Hyeong Kyu and On, Kyoung-Woon and Roh, Byungseok and Kim, Hyunwoo J.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {20105-20115},
  doi       = {10.1109/CVPR52729.2023.01925},
  url       = {https://mlanthology.org/cvpr/2023/ko2023cvpr-meltr/}
}