Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking
Abstract
We investigate the importance of human centered visual cues for predicting the popularity of a public lecture. We construct a large database of more than 1800 TED talk videos and leverage the corresponding (online) viewers' ratings from YouTube for a measure of popularity of the TED talks. Visual cues related to facial and physical appearance, facial expressions, and pose variations are learned using convolutional neural networks (CNN) connected to an attention-based long short-term memory (LSTM) network to predict the video popularity. The proposed overall network is end-to-end-trainable, and achieves state-of-the-art prediction accuracy indicating that the visual cues alone contain highly predictive information about the popularity of a talk. We also demonstrate qualitatively that the network learns a human-like attention mechanism, which is particularly useful for interpretability, i.e. how attention varies with time, and across different visual cues as a function of their relative importance.
Cite
Text
Sharma et al. "Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018. doi:10.1109/WACV.2018.00058Markdown
[Sharma et al. "Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018.](https://mlanthology.org/wacv/2018/sharma2018wacv-multichannel/) doi:10.1109/WACV.2018.00058BibTeX
@inproceedings{sharma2018wacv-multichannel,
title = {{Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking}},
author = {Sharma, Rahul and Guha, Tanaya and Sharma, Gaurav},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2018},
pages = {476-484},
doi = {10.1109/WACV.2018.00058},
url = {https://mlanthology.org/wacv/2018/sharma2018wacv-multichannel/}
}