COHESIV: Contrastive Object and Hand Embedding Segmentation in Video
Abstract
In this paper we learn to segment hands and hand-held objects from motion. Our system takes a single RGB image and hand location as input to segment the hand and hand-held object. For learning, we generate responsibility maps that show how well a hand's motion explains other pixels' motion in video. We use these responsibility maps as pseudo-labels to train a weakly-supervised neural network using an attention-based similarity loss and contrastive loss. Our system outperforms alternate methods, achieving good performance on the 100DOH, EPIC-KITCHENS, and HO3D datasets.
Cite
Text
Shan et al. "COHESIV: Contrastive Object and Hand Embedding Segmentation in Video." Neural Information Processing Systems, 2021.Markdown
[Shan et al. "COHESIV: Contrastive Object and Hand Embedding Segmentation in Video." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/shan2021neurips-cohesiv/)BibTeX
@inproceedings{shan2021neurips-cohesiv,
title = {{COHESIV: Contrastive Object and Hand Embedding Segmentation in Video}},
author = {Shan, Dandan and Higgins, Richard and Fouhey, David},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/shan2021neurips-cohesiv/}
}