OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Abstract
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.
Cite
Text
Huang et al. "OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Huang et al. "OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-otter/)BibTeX
@inproceedings{huang2025icml-otter,
title = {{OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction}},
author = {Huang, Huang and Liu, Fangchen and Fu, Letian and Wu, Tingfan and Mukadam, Mustafa and Malik, Jitendra and Goldberg, Ken and Abbeel, Pieter},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {25566-25580},
volume = {267},
url = {https://mlanthology.org/icml/2025/huang2025icml-otter/}
}