Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning
Abstract
Video captioning automatically generates natural language phrases to explain the contents in video frames. When deploying captioning models in specialized domains, active learning can help reduce the high annotation cost. However, the generative nature of the captioning process is more complex than standard supervised learning tasks and introduces several challenges for active learning in video captioning. Entropy-based uncertainty estimation, which is widely used in active learning, may be inflated in captioning tasks and mislead active sampling. Another challenge arises from the rich content of videos, as each video could be described in multiple ways. A single uncertainty score obtained from one possible caption does not capture the diversity induced by the rich content. To fill out this gap, we propose identifying multiple sources of uncertainty and performing hierarchical aggregation to integrate uncertainty from distinct sources. This innovates a holistic uncertainty metric to quantify the overall informativeness of video content for active sampling. The overall uncertainty is built upon conditional vacuity, an extension of the second-order uncertainty introduced along with the evidential learning framework to the captioning setting, leading to more robust uncertainty estimation without inflation. Both theoretical analysis and experimental evaluation are conducted to demonstrate the effectiveness of the proposed framework for complex uncertainty estimation and interactive learning.
Cite
Text
Zheng and Yu. "Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I13.33590Markdown
[Zheng and Yu. "Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zheng2025aaai-hierarchical/) doi:10.1609/AAAI.V39I13.33590BibTeX
@inproceedings{zheng2025aaai-hierarchical,
title = {{Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning}},
author = {Zheng, Ervine and Yu, Qi},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {14512-14519},
doi = {10.1609/AAAI.V39I13.33590},
url = {https://mlanthology.org/aaai/2025/zheng2025aaai-hierarchical/}
}