Multimodal Interpretable Depression Analysis Using Visual Physiological Audio and Textual Data

Abstract

Motivated by depression's significant impact on global health this work proposes MultiDepNet a novel multimodal interpretable depression detection system integrating visual physiological audio and textual data. Through dedicated feature extraction methods (MTCNN for video TS-CAN for physiological ResNet-18 for audio and RoBERTa for text modalities) and a strategic fusion of modality-specific networks including CNN-RNN Transformer MLP and ResNet-18 it achieves significant advancements in depression detection. Its performance evaluated across four benchmark datasets (AVEC 2013 AVEC 2014 DAIC and E-DAIC) demonstrates average MAE of 5.64 RMSE of 7.15 accuracy of 74.19 precision of 0.7373 recall of 0.7378 and F1 of 0.7376. It also implements a MultiViz-based interpretability mechanism that computes each modality's contribution to the model's performance. The results reveal the visual modality to be the most significant contributing 37.88% towards depression detection.

Cite

Text

Kumar et al. "Multimodal Interpretable Depression Analysis Using Visual Physiological Audio and Textual Data." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Kumar et al. "Multimodal Interpretable Depression Analysis Using Visual Physiological Audio and Textual Data." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/kumar2025wacv-multimodal/)

BibTeX

@inproceedings{kumar2025wacv-multimodal,
  title     = {{Multimodal Interpretable Depression Analysis Using Visual Physiological Audio and Textual Data}},
  author    = {Kumar, Puneet and Misra, Shreshtha and Shao, Zhuhong and Zhu, Bin and Raman, Balasubramanian and Li, Xiaobai},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {5305-5315},
  url       = {https://mlanthology.org/wacv/2025/kumar2025wacv-multimodal/}
}