Learning Modality-Agnostic Representation for Semantic Segmentation from Any Modalities
Abstract
Image modality is not perfect as it often fails in certain conditions, , night and fast motion. This significantly limits the robustness and versatility of existing multi-modal (, Image+X) semantic segmentation methods when confronting modality absence or failure, as often occurred in real-world applications. Inspired by the open-world learning capability of multi-modal vision-language models (MVLMs), we explore a new direction in learning the modality-agnostic representation via knowledge distillation (KD) from MVLMs. Intuitively, we propose Any2Seg , a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions. Specifically, we first introduce a novel language-guided semantic correlation distillation (LSCD) module to transfer both inter-modal and intra-modal semantic knowledge in the embedding space from MVLMs, , LanguageBind [?]. This enables us to minimize the modality gap and alleviate semantic ambiguity to combine any modalities in any visual conditions. Then, we introduce a modality-agnostic feature fusion (MFF) module that reweights the multi-modal features based on the inter-modal correlation and selects the fine-grained feature. This way, our Any2Seg finally yields an optimal modality-agnostic representation. Extensive experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting (+3.54 mIoU) and excels in the challenging modality-incomplete setting(+19.79 mIoU).
Cite
Text
Zheng et al. "Learning Modality-Agnostic Representation for Semantic Segmentation from Any Modalities." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72754-2_9Markdown
[Zheng et al. "Learning Modality-Agnostic Representation for Semantic Segmentation from Any Modalities." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zheng2024eccv-learning-a/) doi:10.1007/978-3-031-72754-2_9BibTeX
@inproceedings{zheng2024eccv-learning-a,
title = {{Learning Modality-Agnostic Representation for Semantic Segmentation from Any Modalities}},
author = {Zheng, Xu and Lyu, Yuanhuiyi and Wang, Lin},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72754-2_9},
url = {https://mlanthology.org/eccv/2024/zheng2024eccv-learning-a/}
}