Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

CVPR 2024 pp. 13428-13437

doi:10.1109/CVPR52733.2024.01275 /cvpr/2024/li2024cvpr-learning/

Abstract

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.However improving their zero-shot reasoning typically requires second-stage instruction tuning which relies heavily on human-labeled or large language model-generated annotation incurring high labeling costs. To tackle this challenge we introduce Image-Conditioned Caption Correction (ICCC) a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01275

Markdown

[Li et al. "Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/li2024cvpr-learning/) doi:10.1109/CVPR52733.2024.01275

BibTeX

@inproceedings{li2024cvpr-learning,
  title     = {{Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning}},
  author    = {Li, Rongjie and Wu, Yu and He, Xuming},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13428-13437},
  doi       = {10.1109/CVPR52733.2024.01275},
  url       = {https://mlanthology.org/cvpr/2024/li2024cvpr-learning/}
}