Retrieval-Augmented Egocentric Video Captioning

Abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper (1) we develop EgoInstructor a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos (2) for training the cross-view retrieval module we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets (3) we train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions (4) through extensive experiments our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning EgoInstructor exhibits significant improvements by leveraging third-person videos as references.

Cite

Text

Xu et al. "Retrieval-Augmented Egocentric Video Captioning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01284

Markdown

[Xu et al. "Retrieval-Augmented Egocentric Video Captioning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/xu2024cvpr-retrievalaugmented/) doi:10.1109/CVPR52733.2024.01284

BibTeX

@inproceedings{xu2024cvpr-retrievalaugmented,
  title     = {{Retrieval-Augmented Egocentric Video Captioning}},
  author    = {Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13525-13536},
  doi       = {10.1109/CVPR52733.2024.01284},
  url       = {https://mlanthology.org/cvpr/2024/xu2024cvpr-retrievalaugmented/}
}