Cross-Domain Multi-Modal Few-Shot Object Detection via Rich Text

Zeyu Shangguan, Daniel Seita, Mohammad Rostami

WACV 2025 pp. 6570-6580

/wacv/2025/shangguan2025wacv-crossdomain/

Abstract

Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks. However existing multi-modal object detection (MM-OD) methods degrade when facing significant domain shift and are sample insufficient. We hypothesize that rich text information could more effectively help the model to build a knowledge relationship between the vision instance and its language description and can help mitigate domain shift. Specifically we study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method that utilizes rich text semantic information as an auxiliary modality to achieve domain adaptation. Our proposed novel neural network contains a multi-modal feature aggregation module that aligns the vision and language support feature embeddings and a rich text semantic rectify module that utilizes bidirectional text feature generation to reinforce multi-modal feature alignment and thus to enhance the model's language understanding capability. We evaluate our model on common standard cross-domain object detection datasets and demonstrate that our approach considerably outperforms existing FSOD methods. Our implementation is publicly available: https://github.com/zshanggu/CDMM

PDF WACV Semantic Scholar

Cite

Text

Shangguan et al. "Cross-Domain Multi-Modal Few-Shot Object Detection via Rich Text." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Shangguan et al. "Cross-Domain Multi-Modal Few-Shot Object Detection via Rich Text." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/shangguan2025wacv-crossdomain/)

BibTeX

@inproceedings{shangguan2025wacv-crossdomain,
  title     = {{Cross-Domain Multi-Modal Few-Shot Object Detection via Rich Text}},
  author    = {Shangguan, Zeyu and Seita, Daniel and Rostami, Mohammad},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {6570-6580},
  url       = {https://mlanthology.org/wacv/2025/shangguan2025wacv-crossdomain/}
}