L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks

Jialong Zuo, Ying Nie, Tianyu Guo, Huaxin Zhang, Jiahao Hong, Nong Sang, Changxin Gao, Kai Han

AAAI 2025 pp. 11095-11103

doi:10.1609/AAAI.V39I10.33206 /aaai/2025/zuo2025aaai-l/

Abstract

Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an open-ended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge ahead along this path from the aspects of dataset and model. Specifically, we first construct a large-scale language-image instruction-following dataset named HumanIns based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. Then, a model named L-Man including a query adapter is designed to extract the multi-grained semantics of image and align the cross-modal information between image and text. In practice, we introduce a two-stage training strategy, where the first stage extracts generic text-relevant visual information, and the second stage maps the visual features to the embedding space of the LLM. By tuning on HumanIns, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves even better results on downstream datasets compared with respective task-specific models.

PDF AAAI Semantic Scholar

Cite

Text

Zuo et al. "L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33206

Markdown

[Zuo et al. "L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zuo2025aaai-l/) doi:10.1609/AAAI.V39I10.33206

BibTeX

@inproceedings{zuo2025aaai-l,
  title     = {{L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks}},
  author    = {Zuo, Jialong and Nie, Ying and Guo, Tianyu and Zhang, Huaxin and Hong, Jiahao and Sang, Nong and Gao, Changxin and Han, Kai},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {11095-11103},
  doi       = {10.1609/AAAI.V39I10.33206},
  url       = {https://mlanthology.org/aaai/2025/zuo2025aaai-l/}
}