Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

Li, Bohan; Sun, Yasheng; Liang, Zhujin; Du, Dalong; Zhang, Zhuanghui; Wang, Xiaofeng; Wang, Yunnan; Jin, Xin; Zeng, Wenjun

doi:10.24963/ijcai.2024/107

Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

Bohan Li, Yasheng Sun, Zhujin Liang, Dalong Du, Zhuanghui Zhang, Xiaofeng Wang, Yunnan Wang, Xin Jin, Wenjun Zeng

IJCAI 2024 pp. 965-973

doi:10.24963/ijcai.2024/107 /ijcai/2024/li2024ijcai-bridging/

Abstract

In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE’s “Any-to-Any” capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model’s output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE

PDF IJCAI Semantic Scholar

Cite

Text

Li et al. "Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/107

Markdown

[Li et al. "Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/li2024ijcai-bridging/) doi:10.24963/ijcai.2024/107

BibTeX

@inproceedings{li2024ijcai-bridging,
  title     = {{Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion}},
  author    = {Li, Bohan and Sun, Yasheng and Liang, Zhujin and Du, Dalong and Zhang, Zhuanghui and Wang, Xiaofeng and Wang, Yunnan and Jin, Xin and Zeng, Wenjun},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {965-973},
  doi       = {10.24963/ijcai.2024/107},
  url       = {https://mlanthology.org/ijcai/2024/li2024ijcai-bridging/}
}