The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Abstract

In recent years the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer we introduce the Ego-Exocentric Conversational Graph Prediction problem marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework---Audio-Visual Conversational Attention (AV-CONV) for the joint prediction of conversation behaviors---speaking and listening---for both the camera wearer as well as all other social partners present in the egocentric video. Specifically we adopt the self-attention mechanism to model the representations across-time across-subjects and across-modalities. To validate our method we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our \href https://vjwq.github.io/AV-CONV/ Project Page .

Cite

Text

Jia et al. "The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02493

Markdown

[Jia et al. "The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/jia2024cvpr-audiovisual/) doi:10.1109/CVPR52733.2024.02493

BibTeX

@inproceedings{jia2024cvpr-audiovisual,
  title     = {{The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective}},
  author    = {Jia, Wenqi and Liu, Miao and Jiang, Hao and Ananthabhotla, Ishwarya and Rehg, James M. and Ithapu, Vamsi Krishna and Gao, Ruohan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {26396-26405},
  doi       = {10.1109/CVPR52733.2024.02493},
  url       = {https://mlanthology.org/cvpr/2024/jia2024cvpr-audiovisual/}
}