Deep Learning Based Multi-Modal Addressee Recognition in Visual Scenes with Utterances
Abstract
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals. Because previous studies typically focused only on pre-specified tasks with limited conversational situations such as controlling smart homes, we created a mock dataset called Addressee Recognition in Visual Scenes with Utterances (ARVSU) that contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario. We also propose a multi-modal deep-learning-based model that takes different human cues, specifically eye gazes and transcripts of an utterance corpus, into account to predict the conversational addressee from a specific speaker's view in various real-life conversational scenarios. To the best of our knowledge, we are the first to introduce an end-to-end deep learning model that combines vision and transcripts of utterance for addressee recognition. As a result, our study suggests that future addressee recognition can reach the ability to understand human intention in many social situations previously unexplored, and our modality dataset is a first step in promoting research in this field.
Cite
Text
Le Minh et al. "Deep Learning Based Multi-Modal Addressee Recognition in Visual Scenes with Utterances." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/214Markdown
[Le Minh et al. "Deep Learning Based Multi-Modal Addressee Recognition in Visual Scenes with Utterances." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/minh2018ijcai-deep/) doi:10.24963/IJCAI.2018/214BibTeX
@inproceedings{minh2018ijcai-deep,
title = {{Deep Learning Based Multi-Modal Addressee Recognition in Visual Scenes with Utterances}},
author = {Le Minh, Thao and Shimizu, Nobuyuki and Miyazaki, Takashi and Shinoda, Koichi},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2018},
pages = {1546-1553},
doi = {10.24963/IJCAI.2018/214},
url = {https://mlanthology.org/ijcai/2018/minh2018ijcai-deep/}
}