SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)

Abstract

When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.

Cite

Text

Naik et al. "SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I21.30486

Markdown

[Naik et al. "SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/naik2024aaai-skillclip/) doi:10.1609/AAAI.V38I21.30486

BibTeX

@inproceedings{naik2024aaai-skillclip,
  title     = {{SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)}},
  author    = {Naik, Atharva and Butala, Yash Parag and Vaikunthan, Navaneethan and Kapoor, Raghav},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {23592-23593},
  doi       = {10.1609/AAAI.V38I21.30486},
  url       = {https://mlanthology.org/aaai/2024/naik2024aaai-skillclip/}
}