Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
Abstract
This paper focuses on open-ended video question answering which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task since a question may have multiple answers. However due to annotation costs the labels in existing benchmarks are always extremely insufficient typically one answer per question. As a result existing works tend to directly treat all the unlabeled answers as negative labels leading to limited ability for generalization. In this work we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers which contain rich knowledge about label priority as well as label-associated visual cues thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.
Cite
Text
Liang et al. "Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01250Markdown
[Liang et al. "Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/liang2024cvpr-ranking/) doi:10.1109/CVPR52733.2024.01250BibTeX
@inproceedings{liang2024cvpr-ranking,
title = {{Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels}},
author = {Liang, Tianming and Tan, Chaolei and Xia, Beihao and Zheng, Wei-Shi and Hu, Jian-Fang},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13161-13170},
doi = {10.1109/CVPR52733.2024.01250},
url = {https://mlanthology.org/cvpr/2024/liang2024cvpr-ranking/}
}