What Do MLLMs Hear? Examining the Interaction Between LLM and Audio Encoder Components in Multimodal Large Language Models
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of generating descriptions of images or sound recordings. We evaluate how MLLMs separate representation of auditory and textual information may sever the reasoning pathway between the audio encoder and the LLM component. Through a captioning-based classification experiment with similar and hierarchical textual relationships, we demonstrate that audio MLLMs cannot fully leverage their LLMs' text-based reasoning when generating audio captions.
Cite
Text
Çoban et al. "What Do MLLMs Hear? Examining the Interaction Between LLM and Audio Encoder Components in Multimodal Large Language Models." NeurIPS 2024 Workshops: Audio_Imagination, 2024.Markdown
[Çoban et al. "What Do MLLMs Hear? Examining the Interaction Between LLM and Audio Encoder Components in Multimodal Large Language Models." NeurIPS 2024 Workshops: Audio_Imagination, 2024.](https://mlanthology.org/neuripsw/2024/coban2024neuripsw-mllms/)BibTeX
@inproceedings{coban2024neuripsw-mllms,
title = {{What Do MLLMs Hear? Examining the Interaction Between LLM and Audio Encoder Components in Multimodal Large Language Models}},
author = {Çoban, Enis Berk and Mandel, Michael I and Devaney, Johanna},
booktitle = {NeurIPS 2024 Workshops: Audio_Imagination},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/coban2024neuripsw-mllms/}
}