Few-Shot Referring Relationships in Videos

Abstract

Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as <subject, predicate, object> and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.

Cite

Text

Kumar and Mishra. "Few-Shot Referring Relationships in Videos." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00227

Markdown

[Kumar and Mishra. "Few-Shot Referring Relationships in Videos." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/kumar2023cvpr-fewshot/) doi:10.1109/CVPR52729.2023.00227

BibTeX

@inproceedings{kumar2023cvpr-fewshot,
  title     = {{Few-Shot Referring Relationships in Videos}},
  author    = {Kumar, Yogesh and Mishra, Anand},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {2289-2298},
  doi       = {10.1109/CVPR52729.2023.00227},
  url       = {https://mlanthology.org/cvpr/2023/kumar2023cvpr-fewshot/}
}