VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Ilaslan, Muhammet Furkan; Köksal, Ali; Lin, Kevin Qinghong; Satar, Burak; Shou, Mike Zheng; Xu, Qianli

doi:10.1609/AAAI.V39I4.32406

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Muhammet Furkan Ilaslan, Ali Köksal, Kevin Qinghong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

AAAI 2025 pp. 3886-3894

doi:10.1609/AAAI.V39I4.32406 /aaai/2025/ilaslan2025aaai-vg/

Abstract

Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

PDF AAAI Semantic Scholar

Cite

Text

Ilaslan et al. "VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32406

Markdown

[Ilaslan et al. "VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/ilaslan2025aaai-vg/) doi:10.1609/AAAI.V39I4.32406

BibTeX

@inproceedings{ilaslan2025aaai-vg,
  title     = {{VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting}},
  author    = {Ilaslan, Muhammet Furkan and Köksal, Ali and Lin, Kevin Qinghong and Satar, Burak and Shou, Mike Zheng and Xu, Qianli},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {3886-3894},
  doi       = {10.1609/AAAI.V39I4.32406},
  url       = {https://mlanthology.org/aaai/2025/ilaslan2025aaai-vg/}
}