Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Abstract
Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as “noise and redundancy”, as well as “memory and computation” constraints. In this paper, we present , a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models’ capabilities in understanding long videos with questions in both vision and text content. approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF,and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding.Our models and code have been made publicly available Goldfish.
Cite
Text
Ataallah et al. "Goldfish: Vision-Language Understanding of Arbitrarily Long Videos." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73397-0_15Markdown
[Ataallah et al. "Goldfish: Vision-Language Understanding of Arbitrarily Long Videos." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ataallah2024eccv-goldfish/) doi:10.1007/978-3-031-73397-0_15BibTeX
@inproceedings{ataallah2024eccv-goldfish,
title = {{Goldfish: Vision-Language Understanding of Arbitrarily Long Videos}},
author = {Ataallah, Kirolos and Shen, Xiaoqian and Abdelrahman, Eslam mohamed and Sleiman, Essam and Zhuge, Mingchen and Ding, Jian and Zhu, Deyao and Schmidhuber, Jürgen and Elhoseiny, Mohamed},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73397-0_15},
url = {https://mlanthology.org/eccv/2024/ataallah2024eccv-goldfish/}
}