Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

Abstract

In this paper we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning where narrative language and evaluative information are predicted separately. However this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task thus enabling task interactivity. To support further research in this field we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code will be released at https://github.com/shiyi-zh0408/NAE_CVPR2024.

Cite

Text

Zhang et al. "Narrative Action Evaluation with Prompt-Guided Multimodal Interaction." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01744

Markdown

[Zhang et al. "Narrative Action Evaluation with Prompt-Guided Multimodal Interaction." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zhang2024cvpr-narrative/) doi:10.1109/CVPR52733.2024.01744

BibTeX

@inproceedings{zhang2024cvpr-narrative,
  title     = {{Narrative Action Evaluation with Prompt-Guided Multimodal Interaction}},
  author    = {Zhang, Shiyi and Bai, Sule and Chen, Guangyi and Chen, Lei and Lu, Jiwen and Wang, Junle and Tang, Yansong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18430-18439},
  doi       = {10.1109/CVPR52733.2024.01744},
  url       = {https://mlanthology.org/cvpr/2024/zhang2024cvpr-narrative/}
}