Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Abstract
Multi-modal large language models (MLLMs) are trained based on large language models (LLM), with an enhanced capability to comprehend multi-modal inputs and generate textual responses. While they excel in multi-modal tasks, the conventional view within the machine learning community has often undervalued/overlooked their capabilities in pure natural language processing. This paper aims to get out of the box and showcase an intriguing characteristic of multi-modal trained LLMs --- our preliminary results suggest that visual instruction tuning, a prevailing strategy to integrate vision knowledge into the LLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment in the pure NLP context. For example, a visual-instruction-tuned LLaMA2 7B model surpasses the performance of the LLaMA2-chat 7B model, fine-tuned with over one million human annotations, on TruthfulQA and Ethics benchmarks. Similarly, the latest LLaMA3 series also shows consistent performance gains by 0.6% on average following visual-instruction tuning. Another example is that two versions of proprietary model GPT-4V-turbo, which incorporates visual information, surpasses its LLM-only counterpart GPT-4-turbo by around 1.6% on both aspects. Further analysis reveals that the improved alignment can be attributed to the superior instruction quality inherent to visual-text data. By presenting those findings, we advocate for a broader exploration into visual-text synergies, positing that such multi-modal interactions could be pivotal in advancing alignment research. In releasing our code at https://github.com/UCSC-VLAA/Sight-Beyond-Text, we aspire to foster further exploration into the intrinsic value of visual-text synergies and, in a broader scope, multi-modal interactions in alignment research.
Cite
Text
Tu et al. "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics." Transactions on Machine Learning Research, 2024.Markdown
[Tu et al. "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/tu2024tmlr-sight/)BibTeX
@article{tu2024tmlr-sight,
title = {{Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics}},
author = {Tu, Haoqin and Zhao, Bingchen and Wei, Chen and Xie, Cihang},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/tu2024tmlr-sight/}
}