Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering

Abstract

Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.

Cite

Text

Zhang et al. "Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I7.28543

Markdown

[Zhang et al. "Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zhang2024aaai-cross-a/) doi:10.1609/AAAI.V38I7.28543

BibTeX

@inproceedings{zhang2024aaai-cross-a,
  title     = {{Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering}},
  author    = {Zhang, Jing and Liu, Xiaoqiang and Chen, Mingzhe and Wang, Zhe},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {7151-7159},
  doi       = {10.1609/AAAI.V38I7.28543},
  url       = {https://mlanthology.org/aaai/2024/zhang2024aaai-cross-a/}
}