LLMs Can See and Hear Without Any Training

Abstract

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Cite

Text

Ashutosh et al. "LLMs Can See and Hear Without Any Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Ashutosh et al. "LLMs Can See and Hear Without Any Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/ashutosh2025icml-llms/)

BibTeX

@inproceedings{ashutosh2025icml-llms,
  title     = {{LLMs Can See and Hear Without Any Training}},
  author    = {Ashutosh, Kumar and Gandelsman, Yossi and Chen, Xinlei and Misra, Ishan and Girdhar, Rohit},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {1762-1776},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/ashutosh2025icml-llms/}
}