Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Abstract

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

Cite

Text

Ashutosh et al. "Video-Mined Task Graphs for Keystep Recognition in Instructional Videos." Neural Information Processing Systems, 2023.

Markdown

[Ashutosh et al. "Video-Mined Task Graphs for Keystep Recognition in Instructional Videos." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/ashutosh2023neurips-videomined/)

BibTeX

@inproceedings{ashutosh2023neurips-videomined,
  title     = {{Video-Mined Task Graphs for Keystep Recognition in Instructional Videos}},
  author    = {Ashutosh, Kumar and Ramakrishnan, Santhosh Kumar and Afouras, Triantafyllos and Grauman, Kristen},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/ashutosh2023neurips-videomined/}
}