Hierarchical Audio-Visual Cue Integration Framework for Activity Analysis in Intelligent Meeting Rooms

Abstract

Scene understanding in the context of a smart meeting room involves the extraction of various kinds of cues at different levels of semantic abstraction. Specifically, human activity in a scene is usually monitored using arrays of audio and visual sensors. Tasks such as person localization and tracking, speaker ID, focus of attention detection, speech recognition and affective state recognition are among them. In this paper we demonstrate a system that extracts such information by synergistically combining the information from the various tasks to support each other. We exploit the fact that the output of one kind of human activity analysis task contains valuable information for another such block and by interconnecting them, a robust system results. We demonstrate this in a smart meeting room context equipped with 3 cameras and 16 microphones. The system performs the tasks of person tracking, head pose estimation, beamforming, speaker ID and speech recognition using audio and visual cues. The novelty lies in putting together the tasks such that they can provide relevant information to one another. We evaluate the performance of our system and present results for tasks such as keyword spotting and tracking re-identification on real-world meeting scenes collected in our audio-visual testbed.

Cite

Text

Shivappa et al. "Hierarchical Audio-Visual Cue Integration Framework for Activity Analysis in Intelligent Meeting Rooms." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2009. doi:10.1109/CVPRW.2009.5204224

Markdown

[Shivappa et al. "Hierarchical Audio-Visual Cue Integration Framework for Activity Analysis in Intelligent Meeting Rooms." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2009.](https://mlanthology.org/cvprw/2009/shivappa2009cvprw-hierarchical/) doi:10.1109/CVPRW.2009.5204224

BibTeX

@inproceedings{shivappa2009cvprw-hierarchical,
  title     = {{Hierarchical Audio-Visual Cue Integration Framework for Activity Analysis in Intelligent Meeting Rooms}},
  author    = {Shivappa, Shankar T. and Trivedi, Mohan M. and Rao, Bhaskar D.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2009},
  pages     = {107-114},
  doi       = {10.1109/CVPRW.2009.5204224},
  url       = {https://mlanthology.org/cvprw/2009/shivappa2009cvprw-hierarchical/}
}