Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
Abstract
Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, , irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9%, 3.1%, and 2.8%.
Cite
Text
Chen et al. "Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73007-8_12Markdown
[Chen et al. "Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/chen2024eccv-learning-c/) doi:10.1007/978-3-031-73007-8_12BibTeX
@inproceedings{chen2024eccv-learning-c,
title = {{Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment}},
author = {Chen, Yuxiao and Li, Kai and Bao, Wentao and Patel, Deep and Kong, Yu and Min, Martin Renqiang and Metaxas, Dimitris N.},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73007-8_12},
url = {https://mlanthology.org/eccv/2024/chen2024eccv-learning-c/}
}