RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Abstract

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (, timestamp) or sequence-level labels (, action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges, we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.

Cite

Text

Zare et al. "RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72980-5_24

Markdown

[Zare et al. "RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zare2024eccv-rap/) doi:10.1007/978-3-031-72980-5_24

BibTeX

@inproceedings{zare2024eccv-rap,
  title     = {{RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos}},
  author    = {Zare, Ali and Niu, Yulei and Ayyubi, Hammad and Chang, Shih-Fu},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72980-5_24},
  url       = {https://mlanthology.org/eccv/2024/zare2024eccv-rap/}
}