Scaling up Video Summarization Pretraining with Large Language Models

Argaw, Dawit Mureja; Yoon, Seunghyun; Heilbron, Fabian Caba; Deilamsalehy, Hanieh; Bui, Trung; Wang, Zhaowen; Dernoncourt, Franck; Chung, Joon Son

doi:10.1109/CVPR52733.2024.00796

Scaling up Video Summarization Pretraining with Large Language Models

Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

CVPR 2024 pp. 8332-8341

doi:10.1109/CVPR52733.2024.00796 /cvpr/2024/argaw2024cvpr-scaling/

Abstract

Long-form video content constitutes a significant portion of internet traffic making automated video summarization an essential research problem. However existing video summarization datasets are notably limited in their size constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.

PDF CVPR Semantic Scholar

Cite

Text

Argaw et al. "Scaling up Video Summarization Pretraining with Large Language Models." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00796

Markdown

[Argaw et al. "Scaling up Video Summarization Pretraining with Large Language Models." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/argaw2024cvpr-scaling/) doi:10.1109/CVPR52733.2024.00796

BibTeX

@inproceedings{argaw2024cvpr-scaling,
  title     = {{Scaling up Video Summarization Pretraining with Large Language Models}},
  author    = {Argaw, Dawit Mureja and Yoon, Seunghyun and Heilbron, Fabian Caba and Deilamsalehy, Hanieh and Bui, Trung and Wang, Zhaowen and Dernoncourt, Franck and Chung, Joon Son},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {8332-8341},
  doi       = {10.1109/CVPR52733.2024.00796},
  url       = {https://mlanthology.org/cvpr/2024/argaw2024cvpr-scaling/}
}