Self-Supervised Cross-View Representation Reconstruction for Change Captioning

Abstract

Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a "hallucination" representation with the caption and "before" representation. By pushing it closer to the "after" representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.

Cite

Text

Tu et al. "Self-Supervised Cross-View Representation Reconstruction for Change Captioning." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00263

Markdown

[Tu et al. "Self-Supervised Cross-View Representation Reconstruction for Change Captioning." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/tu2023iccv-selfsupervised/) doi:10.1109/ICCV51070.2023.00263

BibTeX

@inproceedings{tu2023iccv-selfsupervised,
  title     = {{Self-Supervised Cross-View Representation Reconstruction for Change Captioning}},
  author    = {Tu, Yunbin and Li, Liang and Su, Li and Zha, Zheng-Jun and Yan, Chenggang and Huang, Qingming},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {2805-2815},
  doi       = {10.1109/ICCV51070.2023.00263},
  url       = {https://mlanthology.org/iccv/2023/tu2023iccv-selfsupervised/}
}