FeatSharp: Your Vision Model Features, Sharper

Abstract

The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at $224 \times 224$px, while the "high-resolution" versions are around $378-448$px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available at https://github.com/NVlabs/FeatSharp

Cite

Text

Ranzinger et al. "FeatSharp: Your Vision Model Features, Sharper." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Ranzinger et al. "FeatSharp: Your Vision Model Features, Sharper." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/ranzinger2025icml-featsharp/)

BibTeX

@inproceedings{ranzinger2025icml-featsharp,
  title     = {{FeatSharp: Your Vision Model Features, Sharper}},
  author    = {Ranzinger, Mike and Heinrich, Greg and Molchanov, Pavlo and Catanzaro, Bryan and Tao, Andrew},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {51156-51182},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/ranzinger2025icml-featsharp/}
}