The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
Abstract
In recent years the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer we introduce the Ego-Exocentric Conversational Graph Prediction problem marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework---Audio-Visual Conversational Attention (AV-CONV) for the joint prediction of conversation behaviors---speaking and listening---for both the camera wearer as well as all other social partners present in the egocentric video. Specifically we adopt the self-attention mechanism to model the representations across-time across-subjects and across-modalities. To validate our method we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our \href https://vjwq.github.io/AV-CONV/ Project Page .
Cite
Text
Jia et al. "The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02493Markdown
[Jia et al. "The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/jia2024cvpr-audiovisual/) doi:10.1109/CVPR52733.2024.02493BibTeX
@inproceedings{jia2024cvpr-audiovisual,
title = {{The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective}},
author = {Jia, Wenqi and Liu, Miao and Jiang, Hao and Ananthabhotla, Ishwarya and Rehg, James M. and Ithapu, Vamsi Krishna and Gao, Ruohan},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {26396-26405},
doi = {10.1109/CVPR52733.2024.02493},
url = {https://mlanthology.org/cvpr/2024/jia2024cvpr-audiovisual/}
}