RAVEN: End-to-End Equivariant Robot Learning with RGB Cameras
Abstract
Recent work has shown that equivariant policy networks can achieve strong performance on robot manipulation tasks with limited human demonstrations. However, existing equivariant methods typically require structured inputs, such as 3D point clouds or top-down camera views, which prevents their use in low-cost setups or dynamic environments. In this work, we propose the first $\mathrm{SE}(3)$-equivariant policy learning framework that operates with only RGB image observations. The key insight is to treat image-based data as collections of rays that, unlike 2D pixels, transform under 3D roto-translations. Extensive experiments in both simulation with diverse robot configurations and real-world settings demonstrate that our method consistently surpasses strong baselines in both performance and efficiency.
Cite
Text
Klee et al. "RAVEN: End-to-End Equivariant Robot Learning with RGB Cameras." International Conference on Learning Representations, 2026.Markdown
[Klee et al. "RAVEN: End-to-End Equivariant Robot Learning with RGB Cameras." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/klee2026iclr-raven/)BibTeX
@inproceedings{klee2026iclr-raven,
title = {{RAVEN: End-to-End Equivariant Robot Learning with RGB Cameras}},
author = {Klee, David and Hu, Boce and Cole, Andrew and Tian, Heng and Wang, Dian and Platt, Robert and Walters, Robin},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/klee2026iclr-raven/}
}