Adaptive Feature Abstraction for Translating Video to Language

Abstract

A new model for video captioning is developed, using a deep three-dimensional Convolutional Neural Network (C3D) as an encoder for videos and a Recurrent Neural Network (RNN) as a decoder for captions. We consider both "hard" and "soft" attention mechanisms, to adaptively and sequentially focus on different layers of features (levels of feature "abstraction"), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.

Cite

Text

Pu et al. "Adaptive Feature Abstraction for Translating Video to Language." International Conference on Learning Representations, 2017.

Markdown

[Pu et al. "Adaptive Feature Abstraction for Translating Video to Language." International Conference on Learning Representations, 2017.](https://mlanthology.org/iclr/2017/pu2017iclr-adaptive/)

BibTeX

@inproceedings{pu2017iclr-adaptive,
  title     = {{Adaptive Feature Abstraction for Translating Video to Language}},
  author    = {Pu, Yunchen and Min, Martin Renqiang and Gan, Zhe and Carin, Lawrence},
  booktitle = {International Conference on Learning Representations},
  year      = {2017},
  url       = {https://mlanthology.org/iclr/2017/pu2017iclr-adaptive/}
}