What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions

Abstract

Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our ``"muscly-supervised" representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo (He et al.,2020), on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance). Our code and dataset are available at: https://github.com/ehsanik/muscleTorch.

Cite

Text

Ehsani et al. "What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions." International Conference on Learning Representations, 2021.

Markdown

[Ehsani et al. "What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/ehsani2021iclr-you/)

BibTeX

@inproceedings{ehsani2021iclr-you,
  title     = {{What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions}},
  author    = {Ehsani, Kiana and Gordon, Daniel and Nguyen, Thomas Hai Dang and Mottaghi, Roozbeh and Farhadi, Ali},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/ehsani2021iclr-you/}
}