Unveiling and Mitigating Shortcuts in Multimodal In-Context Learning
Abstract
The performance of Large Vision-Language Models (LVLMs) during in-context learning (ICL) is heavily influenced by shortcut learning, especially in tasks that demand robust multimodal reasoning and open-ended generation. To mitigate this, we introduce task mapping as a novel framework for analyzing shortcut learning and demonstrate that conventional ICD selection methods can disrupt the coherence of task mappings. Building on these insights, we propose Ta-ICL, a task-aware model that enhances task mapping cohesion through task-aware attention and autoregressive retrieval. Extensive experiments on diverse vision-language tasks show that Ta-ICL significantly reduces shortcut learning, improves reasoning consistency, and boosts LVLM adaptability. These findings underscore the potential of task mapping as a key strategy for refining multimodal reasoning, paving the way for more robust and generalizable ICL frameworks.
Cite
Text
Li. "Unveiling and Mitigating Shortcuts in Multimodal In-Context Learning." ICLR 2025 Workshops: SCSL, 2025.Markdown
[Li. "Unveiling and Mitigating Shortcuts in Multimodal In-Context Learning." ICLR 2025 Workshops: SCSL, 2025.](https://mlanthology.org/iclrw/2025/li2025iclrw-unveiling/)BibTeX
@inproceedings{li2025iclrw-unveiling,
title = {{Unveiling and Mitigating Shortcuts in Multimodal In-Context Learning}},
author = {Li, Yanshu},
booktitle = {ICLR 2025 Workshops: SCSL},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/li2025iclrw-unveiling/}
}