Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models

Abstract

With the astonishing ability of different pretrained foundation models (e.g., large language models (LLMs), vision-language models, diffusion models), today’s AI research and development tendency has been revolutionized. In this talk, I will answer two questions: Q1: How can we efficiently train or fine-tune foundation models? Q2: How can we build strong open-world multimodal understanding and generation models with these pretrained foundation models?

Cite

Text

Chen. "Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I27.35101

Markdown

[Chen. "Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/chen2025aaai-open/) doi:10.1609/AAAI.V39I27.35101

BibTeX

@inproceedings{chen2025aaai-open,
  title     = {{Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models}},
  author    = {Chen, Long},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {28706},
  doi       = {10.1609/AAAI.V39I27.35101},
  url       = {https://mlanthology.org/aaai/2025/chen2025aaai-open/}
}