Just Ask Plus: Using Transcripts for VideoQA

Abstract

Social-IQ 2.0 challenge is designed to benchmark recent AI technologies' skills to reason about social interactions, which is referred as Artificial Social Intelligence in the form of a VideoQA task. In this work, we use Just Ask and SpeechT5 models as feature extractors, and reason by adding one attention layer and two transformer encoders. Our best configuration reaches 53.35% accuracy on the validation set. The code is publicly available on GitHub.

Cite

Text

Pirhadi et al. "Just Ask Plus: Using Transcripts for VideoQA." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00332

Markdown

[Pirhadi et al. "Just Ask Plus: Using Transcripts for VideoQA." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/pirhadi2023iccvw-just/) doi:10.1109/ICCVW60793.2023.00332

BibTeX

@inproceedings{pirhadi2023iccvw-just,
  title     = {{Just Ask Plus: Using Transcripts for VideoQA}},
  author    = {Pirhadi, Mohammad Javad and Mirzaei, Motahhare and Eetemadi, Sauleh},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {3074-3077},
  doi       = {10.1109/ICCVW60793.2023.00332},
  url       = {https://mlanthology.org/iccvw/2023/pirhadi2023iccvw-just/}
}