Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Abstract

Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors such as viewpoint changes. However, in practice, viewpoint changes happen often and can overwhelm the semantic difference to be described. In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. Moreover, we further simulate the attention preference of humans and propose a novel reinforcement learning process to fine-tune the attention directly with the language evaluation rewards. Extensive experimental results show that our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.

Cite

Text

Shi et al. "Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58568-6_34

Markdown

[Shi et al. "Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/shi2020eccv-finding/) doi:10.1007/978-3-030-58568-6_34

BibTeX

@inproceedings{shi2020eccv-finding,
  title     = {{Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning}},
  author    = {Shi, Xiangxi and Yang, Xu and Gu, Jiuxiang and Joty, Shafiq and Cai, Jianfei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2020},
  doi       = {10.1007/978-3-030-58568-6_34},
  url       = {https://mlanthology.org/eccv/2020/shi2020eccv-finding/}
}