Let's Roll a BiFTA: Bi-Refinement for Fine-Grained Text-Visual Alignment in Vision-Language Models
Abstract
Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{view refinement} and \emph{description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
Cite
Text
Sun et al. "Let's Roll a BiFTA: Bi-Refinement for Fine-Grained Text-Visual Alignment in Vision-Language Models." Transactions on Machine Learning Research, 2026.Markdown
[Sun et al. "Let's Roll a BiFTA: Bi-Refinement for Fine-Grained Text-Visual Alignment in Vision-Language Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/sun2026tmlr-let/)BibTeX
@article{sun2026tmlr-let,
title = {{Let's Roll a BiFTA: Bi-Refinement for Fine-Grained Text-Visual Alignment in Vision-Language Models}},
author = {Sun, Yuhao and Cai, Chengyi and Zhang, Jiacheng and Ye, Zesheng and Yuan, Xingliang and Liu, Feng},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/sun2026tmlr-let/}
}