Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Abstract
We present Egocentric Action Scene Graphs (EASGs) a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos such as verb-noun action labels by providing a temporally evolving graph-based description of the actions performed by the camera wearer including interacted objects their relationships and how actions unfold in time. Through a novel annotation procedure we extend the Ego4D dataset adding manually labeled Egocentric Action Scene Graphs which offer a rich set of annotations for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach establishing preliminary benchmarks. Experiments on two downstream tasks action anticipation and activity summarization highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and code to replicate experiments and annotations.
Cite
Text
Rodin et al. "Action Scene Graphs for Long-Form Understanding of Egocentric Videos." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01762Markdown
[Rodin et al. "Action Scene Graphs for Long-Form Understanding of Egocentric Videos." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/rodin2024cvpr-action/) doi:10.1109/CVPR52733.2024.01762BibTeX
@inproceedings{rodin2024cvpr-action,
title = {{Action Scene Graphs for Long-Form Understanding of Egocentric Videos}},
author = {Rodin, Ivan and Furnari, Antonino and Min, Kyle and Tripathi, Subarna and Farinella, Giovanni Maria},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {18622-18632},
doi = {10.1109/CVPR52733.2024.01762},
url = {https://mlanthology.org/cvpr/2024/rodin2024cvpr-action/}
}