Video-Audio Domain Generalization via Confounder Disentanglement
Abstract
Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.
Cite
Text
Zhang et al. "Video-Audio Domain Generalization via Confounder Disentanglement." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I12.26787Markdown
[Zhang et al. "Video-Audio Domain Generalization via Confounder Disentanglement." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/zhang2023aaai-video/) doi:10.1609/AAAI.V37I12.26787BibTeX
@inproceedings{zhang2023aaai-video,
title = {{Video-Audio Domain Generalization via Confounder Disentanglement}},
author = {Zhang, Shengyu and Feng, Xusheng and Fan, Wenyan and Fang, Wenjing and Feng, Fuli and Ji, Wei and Li, Shuo and Wang, Li and Zhao, Shanshan and Zhao, Zhou and Chua, Tat-Seng and Wu, Fei},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2023},
pages = {15322-15330},
doi = {10.1609/AAAI.V37I12.26787},
url = {https://mlanthology.org/aaai/2023/zhang2023aaai-video/}
}