Learning to Jointly Understand Visual and Tactile Signals

Abstract

Modeling and analyzing object and shape has been well studied in the past. However, manipulation of these complex tools and articulated objects remains difficult for autonomous agents. Our human hands, however, are dexterous and adaptive. We can easily adapt a manipulation skill on one object to all objects in the class and to other similar classes. Our intuition comes from that there is a close connection between manipulations and topology and articulation of objects. The possible articulation of objects indicates the types of manipulation necessary to operate the object. In this work, we aim to take a manipulation perspective to understand everyday objects and tools. We collect a multi-modal visual-tactile dataset that contains paired full-hand force pressure maps and manipulation videos. We also propose a novel method to learn a cross-modal latent manifold that allow for cross-modal prediction and discovery of latent structure in different data modalities. We conduct extensive experiments to demonstrate the effectiveness of our method.

Cite

Text

Li et al. "Learning to Jointly Understand Visual and Tactile Signals." International Conference on Learning Representations, 2024.

Markdown

[Li et al. "Learning to Jointly Understand Visual and Tactile Signals." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/li2024iclr-learning/)

BibTeX

@inproceedings{li2024iclr-learning,
  title     = {{Learning to Jointly Understand Visual and Tactile Signals}},
  author    = {Li, Yichen and Du, Yilun and Liu, Chao and Liu, Chao and Williams, Francis and Foshey, Michael and Eckart, Benjamin and Kautz, Jan and Tenenbaum, Joshua B. and Torralba, Antonio and Matusik, Wojciech},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/li2024iclr-learning/}
}