Learning to Segment Referred Objects from Narrated Egocentric Videos

Abstract

Egocentric videos provide a first-person perspective of the wearer's activities involving simultaneous interactions with multiple objects. In this work we propose the task of weakly-supervised Narration-based Video Object Segmentation (NVOS). Given an egocentric video clip and a narration of the wearer's activities our aim is to segment object instances mentioned in the narration without using any spatial annotations during training. Existing weakly-supervised video object grounding methods typically yield bounding boxes for referred objects. In contrast we propose ROSA a weakly-supervised pixel-level grounding framework learning alignments between referred objects and segmentation mask proposals. Our model harnesses vision-language models pre-trained on image-text pairs to embed region masks and object phrases. During training we combine (a) a video-narration contrastive loss that implicitly supervises the alignment between regions and phrases and (b) a region-phrase contrastive loss based on inferred latent alignments. To address the lack of annotated NVOS datasets in egocentric videos we create a new evaluation benchmark VISOR-NVOS leveraging existing annotations of segmentation masks from VISOR alongside 14.6k newly-collected object-based video clip narrations. Our approach achieves state-of-the-art zero-shot pixel-level grounding performance compared to strong baselines under similar supervision. Additionally we demonstrate generalization capabilities for zero-shot video object grounding on YouCook2 a third-person instructional video dataset.

Cite

Text

Shen et al. "Learning to Segment Referred Objects from Narrated Egocentric Videos." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01375

Markdown

[Shen et al. "Learning to Segment Referred Objects from Narrated Egocentric Videos." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/shen2024cvpr-learning/) doi:10.1109/CVPR52733.2024.01375

BibTeX

@inproceedings{shen2024cvpr-learning,
  title     = {{Learning to Segment Referred Objects from Narrated Egocentric Videos}},
  author    = {Shen, Yuhan and Wang, Huiyu and Yang, Xitong and Feiszli, Matt and Elhamifar, Ehsan and Torresani, Lorenzo and Mavroudi, Effrosyni},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14510-14520},
  doi       = {10.1109/CVPR52733.2024.01375},
  url       = {https://mlanthology.org/cvpr/2024/shen2024cvpr-learning/}
}