Interleaved-Modal Chain-of-Thought
Abstract
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer.However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image.In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named Interleaved-modal Chain-of-Thought (ICoT), which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer.Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill.Considering that the required visual information is usually part of the input image, we propose Attention-driven Selection (ADS) to realize ICoT over existing VLMs.ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency.ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs.We apply ADS to realize ICoT on two popular VLMs of different architectures.Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14%) and interpretability improvements compared to existing multimodal CoT prompting methods.
Cite
Text
Gao et al. "Interleaved-Modal Chain-of-Thought." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01818Markdown
[Gao et al. "Interleaved-Modal Chain-of-Thought." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/gao2025cvpr-interleavedmodal/) doi:10.1109/CVPR52734.2025.01818BibTeX
@inproceedings{gao2025cvpr-interleavedmodal,
title = {{Interleaved-Modal Chain-of-Thought}},
author = {Gao, Jun and Li, Yongqi and Cao, Ziqiang and Li, Wenjie},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {19520-19529},
doi = {10.1109/CVPR52734.2025.01818},
url = {https://mlanthology.org/cvpr/2025/gao2025cvpr-interleavedmodal/}
}