Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution Using Parameter Augmentation

Abstract

The context of modern smart voice assistants are often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and a human for a shopping use case. We propose and evaluate a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a $4.9\%$ absolute F1 improvement above a baseline while reducing the number of parameters being trained by $13.3\%$ for a complex co-referencinging task on a multi-turn shopping dataset.

Cite

Text

Osebe et al. "Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution Using Parameter Augmentation." ICLR 2024 Workshops: LLMAgents, 2024.

Markdown

[Osebe et al. "Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution Using Parameter Augmentation." ICLR 2024 Workshops: LLMAgents, 2024.](https://mlanthology.org/iclrw/2024/osebe2024iclrw-adapting/)

BibTeX

@inproceedings{osebe2024iclrw-adapting,
  title     = {{Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution Using Parameter Augmentation}},
  author    = {Osebe, Samuel and Wanigasekara, Prashan and Tran, Thanh and Gueudre, Thomas},
  booktitle = {ICLR 2024 Workshops: LLMAgents},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/osebe2024iclrw-adapting/}
}