Text-Conditioned Resampler for Long Form Video Understanding

Abstract

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.

Cite

Text

Korbar et al. "Text-Conditioned Resampler for Long Form Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73016-0_16

Markdown

[Korbar et al. "Text-Conditioned Resampler for Long Form Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/korbar2024eccv-textconditioned/) doi:10.1007/978-3-031-73016-0_16

BibTeX

@inproceedings{korbar2024eccv-textconditioned,
  title     = {{Text-Conditioned Resampler for Long Form Video Understanding}},
  author    = {Korbar, Bruno and Xian, Yongqin and Tonioni, Alessio and Zisserman, Andrew and Tombari, Federico},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73016-0_16},
  url       = {https://mlanthology.org/eccv/2024/korbar2024eccv-textconditioned/}
}