Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

AAAI 2021 pp. 1415-1423

doi:10.1609/AAAI.V35I2.16231 /aaai/2021/geng2021aaai-dynamic/

Abstract

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

PDF AAAI Semantic Scholar

Cite

Text

Geng et al. "Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I2.16231

Markdown

[Geng et al. "Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/geng2021aaai-dynamic/) doi:10.1609/AAAI.V35I2.16231

BibTeX

@inproceedings{geng2021aaai-dynamic,
  title     = {{Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers}},
  author    = {Geng, Shijie and Gao, Peng and Chatterjee, Moitreya and Hori, Chiori and Le Roux, Jonathan and Zhang, Yongfeng and Li, Hongsheng and Cherian, Anoop},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {1415-1423},
  doi       = {10.1609/AAAI.V35I2.16231},
  url       = {https://mlanthology.org/aaai/2021/geng2021aaai-dynamic/}
}