Leveraging Weighted Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network

Abstract

Video captioning has become a broad and interesting research area. Attention-based encoder-decoder methods are extensively used for caption generation. However, these methods mostly utilize the visual attentive feature to highlight the video regions while overlooked the semantic features of the available captions. These semantic features contain significant information that helps to generate highly informative human description-like captions. Therefore, we propose a novel visual and semantic enhanced video captioning network, named as VSVCap, that efficiently utilizes multiple ground truth captions. We aim to generate captions that are visually and semantically enhanced by exploiting both video and text modalities. To achieve this, we propose a fine-grained cross-graph attention mechanism that captures detailed graph embedding correspondence between visual graphs and textual knowledge graphs. We have performed node-level matching and structure-level reasoning between the weighted regional graph and knowledge graph. The proposed network achieves promising results on three benchmark datasets, i.e., YouTube2Text, MSR-VTT, and VATEX. The experimental results show that our network accurately captures all key objects, relationships, and semantically enhanced events of a video to generate human annotation-like captions.

Cite

Text

Verma et al. "Leveraging Weighted Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I2.25343

Markdown

[Verma et al. "Leveraging Weighted Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/verma2023aaai-leveraging/) doi:10.1609/AAAI.V37I2.25343

BibTeX

@inproceedings{verma2023aaai-leveraging,
  title     = {{Leveraging Weighted Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network}},
  author    = {Verma, Deepali and Haldar, Arya and Dutta, Tanima},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {2465-2473},
  doi       = {10.1609/AAAI.V37I2.25343},
  url       = {https://mlanthology.org/aaai/2023/verma2023aaai-leveraging/}
}