Aligning Touch, Vision, and Language for Multimodal Perception

Abstract

Touch, a crucial human sensing modality, has been absent from multimodal generative language models due to challenges in labeling tactile data. This work addresses this gap by leveraging the simultaneous collection of tactile and visual data, allowing GPT-4V to generate pseudo-labels from visual observations alone. The resulting dataset comprises 44K vision-touch pairs with English labels (10% human-annotated, 90% GPT-4V pseudo-labels). A touch-vision-language (TVL) model trained on this dataset shows improved tactile-vision-language alignment (+29% classification accuracy) over existing models and outperforms GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark.

Cite

Text

Fu et al. "Aligning Touch, Vision, and Language for Multimodal Perception." NeurIPS 2024 Workshops: WTP, 2024.

Markdown

[Fu et al. "Aligning Touch, Vision, and Language for Multimodal Perception." NeurIPS 2024 Workshops: WTP, 2024.](https://mlanthology.org/neuripsw/2024/fu2024neuripsw-aligning/)

BibTeX

@inproceedings{fu2024neuripsw-aligning,
  title     = {{Aligning Touch, Vision, and Language for Multimodal Perception}},
  author    = {Fu, Letian and Datta, Gaurav and Huang, Huang and Panitch, William Chung-Ho and Drake, Jaimyn and Ortiz, Joseph and Mukadam, Mustafa and Lambeta, Mike and Calandra, Roberto and Goldberg, Ken},
  booktitle = {NeurIPS 2024 Workshops: WTP},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/fu2024neuripsw-aligning/}
}