Learning to Ground Instructional Articles in Videos Through Narrations

Abstract

In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) direct alignment of step descriptions to frames, ii) indirect alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100M with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-modality alignment yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark.

Cite

Text

Mavroudi et al. "Learning to Ground Instructional Articles in Videos Through Narrations." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01395

Markdown

[Mavroudi et al. "Learning to Ground Instructional Articles in Videos Through Narrations." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/mavroudi2023iccv-learning/) doi:10.1109/ICCV51070.2023.01395

BibTeX

@inproceedings{mavroudi2023iccv-learning,
  title     = {{Learning to Ground Instructional Articles in Videos Through Narrations}},
  author    = {Mavroudi, Effrosyni and Afouras, Triantafyllos and Torresani, Lorenzo},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {15201-15213},
  doi       = {10.1109/ICCV51070.2023.01395},
  url       = {https://mlanthology.org/iccv/2023/mavroudi2023iccv-learning/}
}