Achieving Ensemble-like Performance in a Single Model: A Feature Diversification Framework for Image-Text Matching

Abstract

Model ensembling is a widely used technique that enhances performance in image-text matching tasks by combining multiple models, each trained with different initializations. However, the inefficiencies associated with training several models and generating outputs from them constrain their practical applicability. In this paper, we argue that while the parameters of two randomly initialized models can differ significantly, their feature distributions can be similar at certain stages. By employing a proposed technique called cross-modal realignment, we demonstrate that features derived from differently initialized models maintain similarity at the feature extraction stage and can be effectively transformed by fine-tuning a small number of parameters. These findings provide an efficient way to achieve ensemble-like performance within a single model. Specifically, we propose a Feature Diversification Framework (FDF) that emulates the outputs of multiple model initializations to generate diverse features from a common shared feature. Firstly, we introduce feature conversion methods to transform shared features into a set of distinct features. Next, a realignment training strategy is presented to optimize negative pairs for realigning these transformed features, thereby enhancing their diversification to resemble the outputs of different models. Additionally, we propose a reweighting module that assigns weights to these features, enabling a weighted fusion approach for robust feature representation. Extensive experiments on the Flickr30K and MS-COCO datasets demonstrate the effectiveness and generalizability of our framework.

Cite

Text

Zhou et al. "Achieving Ensemble-like Performance in a Single Model: A Feature Diversification Framework for Image-Text Matching." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33182

Markdown

[Zhou et al. "Achieving Ensemble-like Performance in a Single Model: A Feature Diversification Framework for Image-Text Matching." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhou2025aaai-achieving/) doi:10.1609/AAAI.V39I10.33182

BibTeX

@inproceedings{zhou2025aaai-achieving,
  title     = {{Achieving Ensemble-like Performance in a Single Model: A Feature Diversification Framework for Image-Text Matching}},
  author    = {Zhou, Zhao and Wang, Yiqun and Zhang, Weizhong and Zheng, Yingbin and Du, Xiangcheng and Jin, Cheng},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10879-10886},
  doi       = {10.1609/AAAI.V39I10.33182},
  url       = {https://mlanthology.org/aaai/2025/zhou2025aaai-achieving/}
}