COHESIV: Contrastive Object and Hand Embedding Segmentation in Video

Abstract

In this paper we learn to segment hands and hand-held objects from motion. Our system takes a single RGB image and hand location as input to segment the hand and hand-held object. For learning, we generate responsibility maps that show how well a hand's motion explains other pixels' motion in video. We use these responsibility maps as pseudo-labels to train a weakly-supervised neural network using an attention-based similarity loss and contrastive loss. Our system outperforms alternate methods, achieving good performance on the 100DOH, EPIC-KITCHENS, and HO3D datasets.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Shan et al. "COHESIV: Contrastive Object and Hand Embedding Segmentation in Video." Neural Information Processing Systems, 2021.

Markdown

[Shan et al. "COHESIV: Contrastive Object and Hand Embedding Segmentation in Video." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/shan2021neurips-cohesiv/)

BibTeX

@inproceedings{shan2021neurips-cohesiv,
  title     = {{COHESIV: Contrastive Object and Hand Embedding Segmentation in Video}},
  author    = {Shan, Dandan and Higgins, Richard and Fouhey, David},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/shan2021neurips-cohesiv/}
}