Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning

Abstract

Large language models (LLMs) have achieved a great success in natural language processing, and have a significant potential for multi-modal applications. Despite the surprising zero-shot or few-shot ability, it is also required to effectively fine-tune pre-trained language models for specific downstream tasks. In this paper, we introduce CaptionT5, a video captioning model that fine-tunes T5 towards understanding videos and generating descriptive captions. To generate a more corespondent caption, CaptionT5 introduces thought-augmented fine-tuning for video captioning, in which a pre-trained language model is fine-tuned on thought-augmented video inputs. This resembles the process that human see a video, think of visual concepts such as objects and actions, and then tell a correct and natural sentence based on the thoughts. To automatically generate thoughts, we propose (1) CLIP-guided thought sampling that samples thoughts based on the similarity in an image-text multimodal embedding space by leveraging CLIP. We also propose (2) CLIP-guided caption ranking during decoding for further performance gains. Through experimentation on VATEX, MSRVTT, and YC2 datasets, we empirically demonstrate that CaptionT5 performs competitively against prior-art video captioning approaches without using encoders specialized for video data. Further experiments show that CaptionT5 is especially effective under small number of sampled video frames.

Cite

Text

Kim et al. "Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00187

Markdown

[Kim et al. "Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/kim2024cvprw-show/) doi:10.1109/CVPRW63382.2024.00187

BibTeX

@inproceedings{kim2024cvprw-show,
  title     = {{Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning}},
  author    = {Kim, Byoungjip and Hwang, Dasol and Cho, Sungjun and Jang, Youngsoo and Lee, Honglak and Lee, Moontae},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {1808-1817},
  doi       = {10.1109/CVPRW63382.2024.00187},
  url       = {https://mlanthology.org/cvprw/2024/kim2024cvprw-show/}
}