Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
Abstract
Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.
Cite
Text
Zhang et al. "Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I7.28543Markdown
[Zhang et al. "Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zhang2024aaai-cross-a/) doi:10.1609/AAAI.V38I7.28543BibTeX
@inproceedings{zhang2024aaai-cross-a,
title = {{Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering}},
author = {Zhang, Jing and Liu, Xiaoqiang and Chen, Mingzhe and Wang, Zhe},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2024},
pages = {7151-7159},
doi = {10.1609/AAAI.V38I7.28543},
url = {https://mlanthology.org/aaai/2024/zhang2024aaai-cross-a/}
}