OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wenlong Gao, Hao Dong

CVPR 2025 pp. 17359-17369

doi:10.1109/CVPR52734.2025.01618 /cvpr/2025/pan2025cvpr-omnimanip/

Abstract

The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale data generation.

PDF CVPR Semantic Scholar

Cite

Text

Pan et al. "OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01618

Markdown

[Pan et al. "OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/pan2025cvpr-omnimanip/) doi:10.1109/CVPR52734.2025.01618

BibTeX

@inproceedings{pan2025cvpr-omnimanip,
  title     = {{OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints}},
  author    = {Pan, Mingjie and Zhang, Jiyao and Wu, Tianshu and Zhao, Yinghao and Gao, Wenlong and Dong, Hao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {17359-17369},
  doi       = {10.1109/CVPR52734.2025.01618},
  url       = {https://mlanthology.org/cvpr/2025/pan2025cvpr-omnimanip/}
}