Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

Huang, De-An; Buch, Shyamal; Dery, Lucio; Garg, Animesh; Fei-Fei, Li; Niebles, Juan Carlos

doi:10.1109/cvpr.2018.00623

Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, Juan Carlos Niebles

CVPR 2018

doi:10.1109/cvpr.2018.00623 /cvpr/2018/huang2018cvpr-finding/

Abstract

Grounding textual phrases in visual content with standalone image-sentence pairs is a challenging task. When we consider grounding in instructional videos, this problem becomes profoundly more complex: the latent temporal structure of instructional videos breaks independence assumptions and necessitates contextual understanding for resolving ambiguous visual-linguistic cues. Furthermore, dense annotations and video data scale mean supervised approaches are prohibitively costly. In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision. We introduce the visually grounded action graph, a structured representation capturing the latent dependency between grounding and references in video. For optimization, we propose a new reference-aware multiple instance learning (RA-MIL) objective for weak supervision of grounding in videos. We evaluate our approach over unconstrained videos from YouCookII and RoboWatch, augmented with new reference-grounding test set annotations. We demonstrate that our jointly optimized, reference-aware approach simultaneously improves visual grounding, reference-resolution, and generalization to unseen instructional video categories.

PDF CVPR Semantic Scholar

Cite

Text

Huang et al. "Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. doi:10.1109/cvpr.2018.00623

Markdown

[Huang et al. "Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.](https://mlanthology.org/cvpr/2018/huang2018cvpr-finding/) doi:10.1109/cvpr.2018.00623

BibTeX

@inproceedings{huang2018cvpr-finding,
  title     = {{Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos}},
  author    = {Huang, De-An and Buch, Shyamal and Dery, Lucio and Garg, Animesh and Fei-Fei, Li and Niebles, Juan Carlos},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2018},
  doi       = {10.1109/cvpr.2018.00623},
  url       = {https://mlanthology.org/cvpr/2018/huang2018cvpr-finding/}
}