L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks
Abstract
Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an open-ended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge ahead along this path from the aspects of dataset and model. Specifically, we first construct a large-scale language-image instruction-following dataset named HumanIns based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. Then, a model named L-Man including a query adapter is designed to extract the multi-grained semantics of image and align the cross-modal information between image and text. In practice, we introduce a two-stage training strategy, where the first stage extracts generic text-relevant visual information, and the second stage maps the visual features to the embedding space of the LLM. By tuning on HumanIns, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves even better results on downstream datasets compared with respective task-specific models.
Cite
Text
Zuo et al. "L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33206Markdown
[Zuo et al. "L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zuo2025aaai-l/) doi:10.1609/AAAI.V39I10.33206BibTeX
@inproceedings{zuo2025aaai-l,
title = {{L-Man: A Large Multi-Modal Model Unifying Human-Centric Tasks}},
author = {Zuo, Jialong and Nie, Ying and Guo, Tianyu and Zhang, Huaxin and Hong, Jiahao and Sang, Nong and Gao, Changxin and Han, Kai},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {11095-11103},
doi = {10.1609/AAAI.V39I10.33206},
url = {https://mlanthology.org/aaai/2025/zuo2025aaai-l/}
}