Tracking Humans Using Multi-Modal Fusion

Abstract

Human motion detection plays an important role in automated surveillance systems. However, it is challenging to detect non-rigid moving objects (e.g. human) robustly in a cluttered environment. In this paper, we compare two approaches for detecting walking humans using multi-modal measurements- video and audio sequences. The first approach is based on the Time-Delay Neural Network (TDNN), which fuses the audio and visual data at the feature level to detect the walking human. The second approach employs the Bayesian Network (BN) for jointly modeling the video and audio signals. Parameter estimation of the graphical models is executed using the Expectation-Maximization (EM) algorithm. And the location of the target is tracked by the Bayes inference. Experiments are performed in several indoor and outdoor scenarios: in the lab, more than one person walking, occlusion by bushes etc. The comparison of performance and efficiency of the two approaches are also presented.

Cite

Text

Zou and Bhanu. "Tracking Humans Using Multi-Modal Fusion." IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2005. doi:10.1109/CVPR.2005.545

Markdown

[Zou and Bhanu. "Tracking Humans Using Multi-Modal Fusion." IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2005.](https://mlanthology.org/cvpr/2005/zou2005cvpr-tracking/) doi:10.1109/CVPR.2005.545

BibTeX

@inproceedings{zou2005cvpr-tracking,
  title     = {{Tracking Humans Using Multi-Modal Fusion}},
  author    = {Zou, Xiaotao and Bhanu, Bir},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2005},
  pages     = {4},
  doi       = {10.1109/CVPR.2005.545},
  url       = {https://mlanthology.org/cvpr/2005/zou2005cvpr-tracking/}
}