Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network
Abstract
We introduce Perception Encoder (PE), a family of state-of-the-art vision encoders for image and video understanding. Traditionally, vision encoders have relied on a variety of pretraining objectives, each excelling at different downstream tasks. Surprisingly, after scaling a carefully tuned image pretraining recipe and refining with a robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves state-of-the-art results on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, tracking, and depth estimation. We release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models
Cite
Text
Bolya et al. "Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network." Advances in Neural Information Processing Systems, 2025.Markdown
[Bolya et al. "Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/bolya2025neurips-perception/)BibTeX
@inproceedings{bolya2025neurips-perception,
title = {{Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network}},
author = {Bolya, Daniel and Huang, Po-Yao and Sun, Peize and Cho, Jang Hyun and Madotto, Andrea and Wei, Chen and Ma, Tengyu and Zhi, Jiale and Rajasegaran, Jathushan and Rasheed, Hanoona Abdul and Wang, Junke and Monteiro, Marco and Xu, Hu and Dong, Shiyu and Ravi, Nikhila and Li, Shang-Wen and Dollar, Piotr and Feichtenhofer, Christoph},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/bolya2025neurips-perception/}
}