Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Abstract

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently they are limited in modeling the intricate dynamics of multi-party interactions. In this paper we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification pronoun coreference resolution and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.

Cite

Text

Lee et al. "Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01382

Markdown

[Lee et al. "Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/lee2024cvpr-modeling/) doi:10.1109/CVPR52733.2024.01382

BibTeX

@inproceedings{lee2024cvpr-modeling,
  title     = {{Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations}},
  author    = {Lee, Sangmin and Lai, Bolin and Ryan, Fiona and Boote, Bikram and Rehg, James M.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14585-14595},
  doi       = {10.1109/CVPR52733.2024.01382},
  url       = {https://mlanthology.org/cvpr/2024/lee2024cvpr-modeling/}
}