Action Scene Graphs for Long-Form Understanding of Egocentric Videos

Abstract

We present Egocentric Action Scene Graphs (EASGs) a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos such as verb-noun action labels by providing a temporally evolving graph-based description of the actions performed by the camera wearer including interacted objects their relationships and how actions unfold in time. Through a novel annotation procedure we extend the Ego4D dataset adding manually labeled Egocentric Action Scene Graphs which offer a rich set of annotations for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach establishing preliminary benchmarks. Experiments on two downstream tasks action anticipation and activity summarization highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and code to replicate experiments and annotations.

Cite

Text

Rodin et al. "Action Scene Graphs for Long-Form Understanding of Egocentric Videos." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01762

Markdown

[Rodin et al. "Action Scene Graphs for Long-Form Understanding of Egocentric Videos." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/rodin2024cvpr-action/) doi:10.1109/CVPR52733.2024.01762

BibTeX

@inproceedings{rodin2024cvpr-action,
  title     = {{Action Scene Graphs for Long-Form Understanding of Egocentric Videos}},
  author    = {Rodin, Ivan and Furnari, Antonino and Min, Kyle and Tripathi, Subarna and Farinella, Giovanni Maria},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18622-18632},
  doi       = {10.1109/CVPR52733.2024.01762},
  url       = {https://mlanthology.org/cvpr/2024/rodin2024cvpr-action/}
}