Seeing and Hearing Egocentric Actions: How Much Can We Learn?
Abstract
Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.
Cite
Text
Cartas et al. "Seeing and Hearing Egocentric Actions: How Much Can We Learn?." IEEE/CVF International Conference on Computer Vision Workshops, 2019. doi:10.1109/ICCVW.2019.00548Markdown
[Cartas et al. "Seeing and Hearing Egocentric Actions: How Much Can We Learn?." IEEE/CVF International Conference on Computer Vision Workshops, 2019.](https://mlanthology.org/iccvw/2019/cartas2019iccvw-seeing/) doi:10.1109/ICCVW.2019.00548BibTeX
@inproceedings{cartas2019iccvw-seeing,
title = {{Seeing and Hearing Egocentric Actions: How Much Can We Learn?}},
author = {Cartas, Alejandro and Luque, Jordi and Radeva, Petia and Segura, Carlos and Dimiccoli, Mariella},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2019},
pages = {4470-4480},
doi = {10.1109/ICCVW.2019.00548},
url = {https://mlanthology.org/iccvw/2019/cartas2019iccvw-seeing/}
}