Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition

Abstract

In this paper, we propose a multimodal multi-stream deep learning framework to tackle the egocentric activity recognition problem, using both the video and sensor data. First, we experiment and extend a multi-stream Convolutional Neural Network to learn the spatial and temporal features from egocentric videos. Second, we propose a multistream Long Short-Term Memory architecture to learn the features from multiple sensor streams (accelerometer, gyroscope, etc.). Third, we propose to use a two-level fusion technique and experiment different pooling techniques to compute the prediction results. Experimental results using a multimodal egocentric dataset show that our proposed method can achieve very encouraging performance, despite the constraint that the scale of the existing egocentric datasets is still quite limited.

Cite

Text

Song et al. "Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2016. doi:10.1109/CVPRW.2016.54

Markdown

[Song et al. "Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2016.](https://mlanthology.org/cvprw/2016/song2016cvprw-multimodal/) doi:10.1109/CVPRW.2016.54

BibTeX

@inproceedings{song2016cvprw-multimodal,
  title     = {{Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition}},
  author    = {Song, Sibo and Chandrasekhar, Vijay and Mandal, Bappaditya and Li, Liyuan and Lim, Joo-Hwee and Babu, Giduthuri Sateesh and San, Phyo Phyo and Cheung, Ngai-Man},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2016},
  pages     = {378-385},
  doi       = {10.1109/CVPRW.2016.54},
  url       = {https://mlanthology.org/cvprw/2016/song2016cvprw-multimodal/}
}