Multi-Modal Video Dialog State Tracking in the Wild
Abstract
We present figures/mixeri con.pdf −−anovelvideodialogmodeloperatingoveragenericmulti− modalstatetrackingscheme.Currentmodelsthatclaimtoperf ormmulti−modalstatetrackingf allshortintwoma (1)T heyeithertrackonlyonemodality(mostlythevisualinput)or(2)theytargetsyntheticdatasetsthatdonotref lec worldin−the−wildscenarios.Ourmodeladdressesthesetwolimitationsinanattempttoclosethiscrucialresearch modalgraphstructurelearningmethod.Subsequently, thelearnedlocalgraphsandf eaturesareparsedtogethertof grainedgraphnodef eaturesareusedtoenhancethehiddenstatesof thebackboneV ision− LanguageM odel(V LM ). achievesnewstate−of −the−artresultsonfivechallengingbenchmarks.
Cite
Text
Abdessaied et al. "Multi-Modal Video Dialog State Tracking in the Wild." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72998-0_20Markdown
[Abdessaied et al. "Multi-Modal Video Dialog State Tracking in the Wild." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/abdessaied2024eccv-multimodal/) doi:10.1007/978-3-031-72998-0_20BibTeX
@inproceedings{abdessaied2024eccv-multimodal,
title = {{Multi-Modal Video Dialog State Tracking in the Wild}},
author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72998-0_20},
url = {https://mlanthology.org/eccv/2024/abdessaied2024eccv-multimodal/}
}