Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

Abstract

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/

Cite

Text

Liu et al. "Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58452-8_41

Markdown

[Liu et al. "Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/liu2020eccv-forecasting/) doi:10.1007/978-3-030-58452-8_41

BibTeX

@inproceedings{liu2020eccv-forecasting,
  title     = {{Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video}},
  author    = {Liu, Miao and Tang, Siyu and Li, Yin and Rehg, James M.},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2020},
  doi       = {10.1007/978-3-030-58452-8_41},
  url       = {https://mlanthology.org/eccv/2020/liu2020eccv-forecasting/}
}