CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Somepalli, Gowthami; Chowdhury, Arkabandhu; Basri, Ronen; Geiping, Jonas; Goldstein, Tom; Jacobs, David

doi:10.52202/079017-2952

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Gowthami Somepalli, Arkabandhu Chowdhury, Ronen Basri, Jonas Geiping, Tom Goldstein, David Jacobs

NeurIPS 2024

doi:10.52202/079017-2952 /neurips/2024/somepalli2024neurips-calvin/

Abstract

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding, unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully "contextual" scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model's ability to provide scene captions. Lastly, we observe that our model responds well to prompt engineering and few-shot in-context learning techniques, enabling the user to adapt it to any new movie with very little additional annotation.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Somepalli et al. "CALVIN: Improved Contextual Video Captioning via Instruction Tuning." Neural Information Processing Systems, 2024. doi:10.52202/079017-2952

Markdown

[Somepalli et al. "CALVIN: Improved Contextual Video Captioning via Instruction Tuning." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/somepalli2024neurips-calvin/) doi:10.52202/079017-2952

BibTeX

@inproceedings{somepalli2024neurips-calvin,
  title     = {{CALVIN: Improved Contextual Video Captioning via Instruction Tuning}},
  author    = {Somepalli, Gowthami and Chowdhury, Arkabandhu and Basri, Ronen and Geiping, Jonas and Goldstein, Tom and Jacobs, David},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2952},
  url       = {https://mlanthology.org/neurips/2024/somepalli2024neurips-calvin/}
}