AMEGO: Active Memory from Long EGOcentric Videos

Abstract

Egocentric videos provide a unique perspective into individuals’ daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce , a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human’s ability to maintain information from a single watching, focuses on constructing a self-contained representations from one egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new (), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of on , surpassing other video QA baselines by a substantial margin.

Cite

Text

Goletto et al. "AMEGO: Active Memory from Long EGOcentric Videos." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72624-8_6

Markdown

[Goletto et al. "AMEGO: Active Memory from Long EGOcentric Videos." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/goletto2024eccv-amego/) doi:10.1007/978-3-031-72624-8_6

BibTeX

@inproceedings{goletto2024eccv-amego,
  title     = {{AMEGO: Active Memory from Long EGOcentric Videos}},
  author    = {Goletto, Gabriele and Nagarajan, Tushar and Averta, Giuseppe and Damen, Dima},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72624-8_6},
  url       = {https://mlanthology.org/eccv/2024/goletto2024eccv-amego/}
}