UFO: A Unified Approach to Fine-Grained Visual Perception via Open-Ended Language Interface
Abstract
Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present UFO, a framework that unifies fine-grained visual perception tasks through an open-ended language interface. By transforming all perception targets into the language space, UFO unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, UFO outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby achieving superior performance on the challenging reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
Cite
Text
Tang et al. "UFO: A Unified Approach to Fine-Grained Visual Perception via Open-Ended Language Interface." Advances in Neural Information Processing Systems, 2025.Markdown
[Tang et al. "UFO: A Unified Approach to Fine-Grained Visual Perception via Open-Ended Language Interface." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/tang2025neurips-ufo/)BibTeX
@inproceedings{tang2025neurips-ufo,
title = {{UFO: A Unified Approach to Fine-Grained Visual Perception via Open-Ended Language Interface}},
author = {Tang, Hao and Xie, Chen-Wei and Wang, Haiyang and Bao, Xiaoyi and Weng, Tingyu and Li, Pandeng and Zheng, Yun and Wang, Liwei},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/tang2025neurips-ufo/}
}