Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Abstract
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.However improving their zero-shot reasoning typically requires second-stage instruction tuning which relies heavily on human-labeled or large language model-generated annotation incurring high labeling costs. To tackle this challenge we introduce Image-Conditioned Caption Correction (ICCC) a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
Cite
Text
Li et al. "Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01275Markdown
[Li et al. "Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/li2024cvpr-learning/) doi:10.1109/CVPR52733.2024.01275BibTeX
@inproceedings{li2024cvpr-learning,
title = {{Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning}},
author = {Li, Rongjie and Wu, Yu and He, Xuming},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13428-13437},
doi = {10.1109/CVPR52733.2024.01275},
url = {https://mlanthology.org/cvpr/2024/li2024cvpr-learning/}
}