DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning

Haoran Xu, Peixi Peng, Guang Tan, Yuan Li, Xinhai Xu, Yonghong Tian

CVPR 2024 pp. 26508-26518

doi:10.1109/CVPR52733.2024.02503 /cvpr/2024/xu2024cvpr-dmr/

Abstract

We explore visual reinforcement learning (RL) using two complementary visual modalities: frame-based RGB camera and event-based Dynamic Vision Sensor (DVS). Existing multi-modality visual RL methods often encounter challenges in effectively extracting task-relevant information from multiple modalities while suppressing the increased noise only using indirect reward signals instead of pixel-level supervision. To tackle this we propose a Decomposed Multi-Modality Representation (DMR) framework for visual RL. It explicitly decomposes the inputs into three distinct components: combined task-relevant features (co-features) RGB-specific noise and DVS-specific noise. The co-features represent the full information from both modalities that is relevant to the RL task; the two noise components each constrained by a data reconstruction loss to avoid information leak are contrasted with the co-features to maximize their difference. Extensive experiments demonstrate that by explicitly separating the different types of information our approach achieves substantially improved policy performance compared to state-of-the-art approaches.

PDF CVPR Semantic Scholar

Cite

Text

Xu et al. "DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02503

Markdown

[Xu et al. "DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/xu2024cvpr-dmr/) doi:10.1109/CVPR52733.2024.02503

BibTeX

@inproceedings{xu2024cvpr-dmr,
  title     = {{DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning}},
  author    = {Xu, Haoran and Peng, Peixi and Tan, Guang and Li, Yuan and Xu, Xinhai and Tian, Yonghong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {26508-26518},
  doi       = {10.1109/CVPR52733.2024.02503},
  url       = {https://mlanthology.org/cvpr/2024/xu2024cvpr-dmr/}
}