Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Abstract

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e, through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by evaluating its temporal and spatial (grounding) performance by fine-tuning for this task. In all cases the performance improves over the state of the art -- even for networks trained with far larger batch sizes. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model, and improve performance.

Cite

Text

Zhang et al. "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01278

Markdown

[Zhang et al. "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/zhang2023iccv-helping/) doi:10.1109/ICCV51070.2023.01278

BibTeX

@inproceedings{zhang2023iccv-helping,
  title     = {{Helping Hands: An Object-Aware Ego-Centric Video Recognition Model}},
  author    = {Zhang, Chuhan and Gupta, Ankush and Zisserman, Andrew},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13901-13912},
  doi       = {10.1109/ICCV51070.2023.01278},
  url       = {https://mlanthology.org/iccv/2023/zhang2023iccv-helping/}
}