Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

Abstract

This paper focuses on open-ended video question answering which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task since a question may have multiple answers. However due to annotation costs the labels in existing benchmarks are always extremely insufficient typically one answer per question. As a result existing works tend to directly treat all the unlabeled answers as negative labels leading to limited ability for generalization. In this work we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers which contain rich knowledge about label priority as well as label-associated visual cues thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.

Cite

Text

Liang et al. "Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01250

Markdown

[Liang et al. "Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/liang2024cvpr-ranking/) doi:10.1109/CVPR52733.2024.01250

BibTeX

@inproceedings{liang2024cvpr-ranking,
  title     = {{Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels}},
  author    = {Liang, Tianming and Tan, Chaolei and Xia, Beihao and Zheng, Wei-Shi and Hu, Jian-Fang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13161-13170},
  doi       = {10.1109/CVPR52733.2024.01250},
  url       = {https://mlanthology.org/cvpr/2024/liang2024cvpr-ranking/}
}