Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-Form Video Understanding
Abstract
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically-consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically-consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
Cite
Text
Afham et al. "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-Form Video Understanding." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00128Markdown
[Afham et al. "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-Form Video Understanding." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/afham2023iccvw-revisiting/) doi:10.1109/ICCVW60793.2023.00128BibTeX
@inproceedings{afham2023iccvw-revisiting,
title = {{Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-Form Video Understanding}},
author = {Afham, Mohamed and Shukla, Satya Narayan and Poursaeed, Omid and Zhang, Pengchuan and Shah, Ashish and Lim, Sernam},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2023},
pages = {1181-1186},
doi = {10.1109/ICCVW60793.2023.00128},
url = {https://mlanthology.org/iccvw/2023/afham2023iccvw-revisiting/}
}