STAViS: Spatio-Temporal AudioVisual Saliency Network

Abstract

We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different state-of-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild". The code is available at https://github.com/atsiami/STAViS.

Cite

Text

Tsiami et al. "STAViS: Spatio-Temporal AudioVisual Saliency Network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.00482

Markdown

[Tsiami et al. "STAViS: Spatio-Temporal AudioVisual Saliency Network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/tsiami2020cvpr-stavis/) doi:10.1109/CVPR42600.2020.00482

BibTeX

@inproceedings{tsiami2020cvpr-stavis,
  title     = {{STAViS: Spatio-Temporal AudioVisual Saliency Network}},
  author    = {Tsiami, Antigoni and Koutras, Petros and Maragos, Petros},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.00482},
  url       = {https://mlanthology.org/cvpr/2020/tsiami2020cvpr-stavis/}
}