FlexCap: Describe Anything in Images in Controllable Detail
Abstract
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap’s effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap’s localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap’s utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io.
Cite
Text
Dwibedi et al. "FlexCap: Describe Anything in Images in Controllable Detail." Neural Information Processing Systems, 2024. doi:10.52202/079017-3530Markdown
[Dwibedi et al. "FlexCap: Describe Anything in Images in Controllable Detail." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/dwibedi2024neurips-flexcap/) doi:10.52202/079017-3530BibTeX
@inproceedings{dwibedi2024neurips-flexcap,
title = {{FlexCap: Describe Anything in Images in Controllable Detail}},
author = {Dwibedi, Debidatta and Jain, Vidhi and Tompson, Jonathan and Zisserman, Andrew and Aytar, Yusuf},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3530},
url = {https://mlanthology.org/neurips/2024/dwibedi2024neurips-flexcap/}
}