Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Abstract
Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
Cite
Text
Saxena et al. "Multi-Resolution Sensing for Real-Time Control with Vision-Language Models." Conference on Robot Learning, 2023.Markdown
[Saxena et al. "Multi-Resolution Sensing for Real-Time Control with Vision-Language Models." Conference on Robot Learning, 2023.](https://mlanthology.org/corl/2023/saxena2023corl-multiresolution/)BibTeX
@inproceedings{saxena2023corl-multiresolution,
title = {{Multi-Resolution Sensing for Real-Time Control with Vision-Language Models}},
author = {Saxena, Saumya and Sharma, Mohit and Kroemer, Oliver},
booktitle = {Conference on Robot Learning},
year = {2023},
pages = {2210-2228},
volume = {229},
url = {https://mlanthology.org/corl/2023/saxena2023corl-multiresolution/}
}