Why Only Text: Empowering Vision-and-Language Navigation with Multi-Modal Prompts

Hong, Haodong; Wang, Sen; Huang, Zi; Wu, Qi; Liu, Jiajun

doi:10.24963/ijcai.2024/93

Why Only Text: Empowering Vision-and-Language Navigation with Multi-Modal Prompts

Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu

IJCAI 2024 pp. 839-847

doi:10.24963/ijcai.2024/93 /ijcai/2024/hong2024ijcai-only/

Abstract

Deep neural networks (DNNs) face substantial challenges in Long-Tail Visual Recognition (LTVR) due to the inherent class imbalances in real-world data distributions. The Mixture of Experts (MoE) framework has emerged as a promising approach to addressing these issues. However, in MoE systems, experts are typically trained to optimize a collective objective, often neglecting the individual optimality of each expert. This individual optimality usually contributes to the overall performance, as the goals of different experts are not mutually exclusive. We propose the Independent and Collaborative Learning (ICL) framework to optimize each expert independently while ensuring global optimality. First, Diverse Optimization Learning (DOL) is introduced to enhance expert diversity and individual performance. Then, we conceptualize experts as parallel circuit branches and introduce Competition and Collaboration Learning (CoL). Competition Learning amplifies the gradients of better-performing experts to preserve individual optimality, and Collaboration Learning encourages collaboration through mutual distillation to enhance optimal knowledge sharing. ICL achieves state-of-the-art accuracy in experiments on CIFAR-100/10-LT, ImageNet-LT, and iNaturalist 2018, respectively. Our code is available at https://github.com/PolarisLight/ICL.

PDF IJCAI Semantic Scholar

Cite

Text

Hong et al. "Why Only Text: Empowering Vision-and-Language Navigation with Multi-Modal Prompts." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/93

Markdown

[Hong et al. "Why Only Text: Empowering Vision-and-Language Navigation with Multi-Modal Prompts." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/hong2024ijcai-only/) doi:10.24963/ijcai.2024/93

BibTeX

@inproceedings{hong2024ijcai-only,
  title     = {{Why Only Text: Empowering Vision-and-Language Navigation with Multi-Modal Prompts}},
  author    = {Hong, Haodong and Wang, Sen and Huang, Zi and Wu, Qi and Liu, Jiajun},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {839-847},
  doi       = {10.24963/ijcai.2024/93},
  url       = {https://mlanthology.org/ijcai/2024/hong2024ijcai-only/}
}