GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering
Abstract
Deep learning models with large-scale backbones have been increasingly adopted to tackle complex visual question answering (VQA) problems in real settings. While providing powerful learning capacities to handle the high-dimensional and multimodal VQA data, these models tend to suffer from the memorization effect leading to overconfident predictions. This can significantly limit their applicability in critical domains (e.g., medicine, cyber-security, and public safety), where confidently wrong predictions may lead to severe consequences. In this work, we propose to perform novel low-rank network factorization, resulting in much better-calibrated networks. These low-rank factorized networks are then aggregated into an ensemble guided by a generalized focal loss to further improve the overall performance and calibration. The overall framework, referred to as the Generalized focal Loss Ensemble of low-rank Networks (GLEN), is an important step toward developing well-calibrated VQA models. We theoretically demonstrate that the generalized focal loss provides a more balanced bias-variance trade-off, which guarantees to lower the confidence of the incorrect predictions, resulting in improved calibration. Extensive experimentation conducted on benchmark datasets and comparison on various VQA models shows that GLEN leads to much better calibration over both in-distribution and out-of-distribution data without sacrificing the VQA accuracy.
Cite
Text
Mozaffari et al. "GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I18.34154Markdown
[Mozaffari et al. "GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/mozaffari2025aaai-glen/) doi:10.1609/AAAI.V39I18.34154BibTeX
@inproceedings{mozaffari2025aaai-glen,
title = {{GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering}},
author = {Mozaffari, Mahsa and Sapkota, Hitesh and Yu, Qi},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {19563-19571},
doi = {10.1609/AAAI.V39I18.34154},
url = {https://mlanthology.org/aaai/2025/mozaffari2025aaai-glen/}
}