Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

Abstract

We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world' knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.

Cite

Text

Krishnamoorthy et al. "Generating Natural-Language Video Descriptions Using Text-Mined Knowledge." AAAI Conference on Artificial Intelligence, 2013. doi:10.1609/AAAI.V27I1.8679

Markdown

[Krishnamoorthy et al. "Generating Natural-Language Video Descriptions Using Text-Mined Knowledge." AAAI Conference on Artificial Intelligence, 2013.](https://mlanthology.org/aaai/2013/krishnamoorthy2013aaai-generating/) doi:10.1609/AAAI.V27I1.8679

BibTeX

@inproceedings{krishnamoorthy2013aaai-generating,
  title     = {{Generating Natural-Language Video Descriptions Using Text-Mined Knowledge}},
  author    = {Krishnamoorthy, Niveda and Malkarnenkar, Girish and Mooney, Raymond J. and Saenko, Kate and Guadarrama, Sergio},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2013},
  pages     = {541-547},
  doi       = {10.1609/AAAI.V27I1.8679},
  url       = {https://mlanthology.org/aaai/2013/krishnamoorthy2013aaai-generating/}
}