Scalable Vision-Language Understanding and Generation

Abstract

Recent advances in vision-language models have shown remarkable potential, yet creating scalable systems that can effectively understand and generate across modalities remains challenging. This talk will present our contributions to advancing scalable vision-language systems, focusing on three key themes: (1) efficient vision-language understanding, including our work on temporal perceiving video-language pre-training and knowledge-enhanced zero-shot retrieval; (2) scalable generation frameworks, encompassing our innovations in zero-shot captioning and co-speech gesture generation; and (3) practical applications and deployments of these technologies. We will discuss how these advances have enabled both better performance and improved efficiency in real-world scenarios, and explore future directions for scalable multimodal systems.

Cite

Text

Zhu. "Scalable Vision-Language Understanding and Generation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I27.35130

Markdown

[Zhu. "Scalable Vision-Language Understanding and Generation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhu2025aaai-scalable/) doi:10.1609/AAAI.V39I27.35130

BibTeX

@inproceedings{zhu2025aaai-scalable,
  title     = {{Scalable Vision-Language Understanding and Generation}},
  author    = {Zhu, Linchao},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {28738-28739},
  doi       = {10.1609/AAAI.V39I27.35130},
  url       = {https://mlanthology.org/aaai/2025/zhu2025aaai-scalable/}
}