Cross-Modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation

Abstract

Since single-modal controllable manipulation typically requires supervision of information from other modalities or cooperation with complex software and experts, this paper addresses the problem of cross-modal adaptive manipulation (CAM). The novel task performs cross-modal semantic alignment from mutual supervision and implements bidirectional exchange of attributes, relations, or objects in parallel, benefiting both modalities while significantly reducing manual effort. We introduce a robust solution for CAM, which includes two essential modules, namely Heterogeneous Representation Learning (HRL) and Cross-modal Relation Reasoning (CRR). The former is designed to perform representation learning for cross-modal semantic alignment on heterogeneous graph nodes. The latter is adopted to identify and exchange the focused attributes, relations, or objects in both modalities. Our method produces pleasing cross-modal outputs on CUB and Visual Genome.

Cite

Text

Li et al. "Cross-Modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation." International Joint Conference on Artificial Intelligence, 2022. doi:10.24963/IJCAI.2022/447

Markdown

[Li et al. "Cross-Modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation." International Joint Conference on Artificial Intelligence, 2022.](https://mlanthology.org/ijcai/2022/li2022ijcai-cross/) doi:10.24963/IJCAI.2022/447

BibTeX

@inproceedings{li2022ijcai-cross,
  title     = {{Cross-Modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation}},
  author    = {Li, Lei and Fan, Kai and Yuan, Chun},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {3222-3228},
  doi       = {10.24963/IJCAI.2022/447},
  url       = {https://mlanthology.org/ijcai/2022/li2022ijcai-cross/}
}