A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition

Abstract

Temporal information is the main source of discriminating characteristics for the recognition of proprioceptive activities in first-person vision (FPV). In this paper, we propose a motion representation that uses stacked spectrograms. These spectrograms are generated over temporal windows from mean grid-optical-flow vectors and the displacement vectors of the intensity centroid. The stacked representation enables us to use 2D convolutions to learn and extract global motion features. Moreover, we employ a long short-term memory (LSTM) network to encode the temporal dependency among consecutive samples recursively. Experimental results show that the proposed approach achieves state-of-the-art performance in the largest public dataset for FPV activity recognition.

Cite

Text

Abebe and Cavallaro. "A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition." IEEE/CVF International Conference on Computer Vision Workshops, 2017. doi:10.1109/ICCVW.2017.159

Markdown

[Abebe and Cavallaro. "A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition." IEEE/CVF International Conference on Computer Vision Workshops, 2017.](https://mlanthology.org/iccvw/2017/abebe2017iccvw-long/) doi:10.1109/ICCVW.2017.159

BibTeX

@inproceedings{abebe2017iccvw-long,
  title     = {{A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition}},
  author    = {Abebe, Girmaw and Cavallaro, Andrea},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2017},
  pages     = {1339-1346},
  doi       = {10.1109/ICCVW.2017.159},
  url       = {https://mlanthology.org/iccvw/2017/abebe2017iccvw-long/}
}