PixelLM: Pixel Reasoning with Large Multimodal Model
Abstract
While large multimodal models (LMMs) have achieved remarkable progress generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap we introduce PixelLM an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM are a novel lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens which encode detailed target-relevant information. With this design PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore we propose a token fusion method to enhance the model's ability to differentiate between multiple targets leading to substantially improved mask quality. To advance research in this area we construct MUSE a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks outperforming well-established methods in multiple benchmarks including MUSE and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code models and datasets will be publicly available.
Cite
Text
Ren et al. "PixelLM: Pixel Reasoning with Large Multimodal Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02491Markdown
[Ren et al. "PixelLM: Pixel Reasoning with Large Multimodal Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ren2024cvpr-pixellm/) doi:10.1109/CVPR52733.2024.02491BibTeX
@inproceedings{ren2024cvpr-pixellm,
title = {{PixelLM: Pixel Reasoning with Large Multimodal Model}},
author = {Ren, Zhongwei and Huang, Zhicheng and Wei, Yunchao and Zhao, Yao and Fu, Dongmei and Feng, Jiashi and Jin, Xiaojie},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {26374-26383},
doi = {10.1109/CVPR52733.2024.02491},
url = {https://mlanthology.org/cvpr/2024/ren2024cvpr-pixellm/}
}