Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Abstract

We introduce Florence-2 a novel vision foundation model with a unified prompt-based representation for various computer vision and vision-language tasks. While existing large vision models excel in transfer learning they struggle to perform diverse tasks with simple instructions a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms whether it be captioning object detection grounding or segmentation. This multi-task learning setup demands large-scale high-quality annotated data. To this end we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Cite

Text

Xiao et al. "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00461

Markdown

[Xiao et al. "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/xiao2024cvpr-florence2/) doi:10.1109/CVPR52733.2024.00461

BibTeX

@inproceedings{xiao2024cvpr-florence2,
  title     = {{Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks}},
  author    = {Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {4818-4829},
  doi       = {10.1109/CVPR52733.2024.00461},
  url       = {https://mlanthology.org/cvpr/2024/xiao2024cvpr-florence2/}
}