What Makes Training Multi-Modal Classification Networks Hard?
Abstract
Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its uni-modal counterpart. In our experiments, however, we observe the opposite: the best uni-modal network can outperform the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks for video classifications. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient-Blending, which computes an optimal blending of modalities based on their overfitting behaviors. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.
Cite
Text
Wang et al. "What Makes Training Multi-Modal Classification Networks Hard?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01271Markdown
[Wang et al. "What Makes Training Multi-Modal Classification Networks Hard?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/wang2020cvpr-makes/) doi:10.1109/CVPR42600.2020.01271BibTeX
@inproceedings{wang2020cvpr-makes,
title = {{What Makes Training Multi-Modal Classification Networks Hard?}},
author = {Wang, Weiyao and Tran, Du and Feiszli, Matt},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2020},
doi = {10.1109/CVPR42600.2020.01271},
url = {https://mlanthology.org/cvpr/2020/wang2020cvpr-makes/}
}