Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Abstract

Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust out- come. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6\% to 75\% and setting a new bar for state-of-art performance. We further showcase the utility of our generated assets by using them to train robotic policies for fine-grained manipulation tasks that go beyond basic pick and place.

Cite

Text

Le et al. "Articulate-Anything:  Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model." International Conference on Learning Representations, 2025.

Markdown

[Le et al. "Articulate-Anything:  Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/le2025iclr-articulateanything/)

BibTeX

@inproceedings{le2025iclr-articulateanything,
  title     = {{Articulate-Anything:  Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model}},
  author    = {Le, Long and Xie, Jason and Liang, William and Wang, Hung-Ju and Yang, Yue and Ma, Yecheng Jason and Vedder, Kyle and Krishna, Arjun and Jayaraman, Dinesh and Eaton, Eric},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/le2025iclr-articulateanything/}
}