X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Abstract

Recent advances in large language models have enabled task prompting for open-ended text generation. In the vision domain, a longstanding goal is developing models capable of general visual learning, encompassing tasks such as image generation, editing, low-level processing, and dense perception. Although recent efforts have aimed at building vision foundation models that support prompting, significant challenges remain, particularly in accurately comprehending visual prompts and addressing the ambiguity inherent in textual prompts. To address this, we introduce X-Prompt, a purely auto-regressive large vision-language model designed for generalizable visual learning via in-context prompting. X-Prompt can process visual and textual prompts as context, enabling precise task interpretation and accurate execution. A novel prompt-token fusion mechanism effectively extracts relevant task information from complex prompts while significantly reducing the token length. Additionally, a unified training strategy for text and image prediction enhances task awareness, enabling seamless adaptation to open-ended prompts. Extensive experiments demonstrate that X-Prompt effectively interprets in-context prompts and exhibits generalization across both in-domain and out-of-domain visual tasks, paving the way for future advancements in general visual learning.

Cite

Text

Sun et al. "X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting." International Conference on Computer Vision, 2025.

Markdown

[Sun et al. "X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sun2025iccv-xprompt/)

BibTeX

@inproceedings{sun2025iccv-xprompt,
  title     = {{X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting}},
  author    = {Sun, Zeyi and Chu, Ziyang and Zhang, Pan and Wu, Tong and Zang, Yuhang and Dong, Xiaoyi and Xiong, Yuanjun and Lin, Dahua and Wang, Jiaqi},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17268-17280},
  url       = {https://mlanthology.org/iccv/2025/sun2025iccv-xprompt/}
}