United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Abstract

Given multiple videos of the same task, procedure learning addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algorithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art.

Cite

Text

Bansal et al. "United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Bansal et al. "United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/bansal2024wacv-united/)

BibTeX

@inproceedings{bansal2024wacv-united,
  title     = {{United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos}},
  author    = {Bansal, Siddhant and Arora, Chetan and Jawahar, C. V.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {6509-6519},
  url       = {https://mlanthology.org/wacv/2024/bansal2024wacv-united/}
}