Relational Space-Time Query in Long-Form Videos
Abstract
Egocentric videos are often available in the form of uninterrupted, uncurated long videos capturing the camera wearers' daily life activities.Understanding these videos requires models to be able to reason about activities, objects, and their interactions. However, current video benchmarks study these problems independently and under short, curated clips. In contrast, real-world applications, e.g., AR assistants, require bundling these problems for both model development and evaluation. In this paper, we propose to study these problems in a joint framework for long video understanding. Our contributions are three-fold. First, we propose an integrated framework, namely Relational Space-Time Query (ReST), for evaluating video understanding models via templated spatiotemporal queries. Second, we introduce two new benchmarks, ReST-ADL and ReST-Ego4D, which augment the existing egocentric video datasets with abundant query annotations generated by the ReST framework. Finally, we present a set of baselines and in-depth analysis on the two benchmarks and provide insights about the query tasks. We view our integrated framework and benchmarks as a step towards comprehensive, multi-step reasoning in long videos, and believe it will facilitate the development of next generations of video understanding models.
Cite
Text
Yang et al. "Relational Space-Time Query in Long-Form Videos." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00619Markdown
[Yang et al. "Relational Space-Time Query in Long-Form Videos." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yang2023cvpr-relational/) doi:10.1109/CVPR52729.2023.00619BibTeX
@inproceedings{yang2023cvpr-relational,
title = {{Relational Space-Time Query in Long-Form Videos}},
author = {Yang, Xitong and Chu, Fu-Jen and Feiszli, Matt and Goyal, Raghav and Torresani, Lorenzo and Tran, Du},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {6398-6408},
doi = {10.1109/CVPR52729.2023.00619},
url = {https://mlanthology.org/cvpr/2023/yang2023cvpr-relational/}
}