PixelLM: Pixel Reasoning with Large Multimodal Model

Abstract

While large multimodal models (LMMs) have achieved remarkable progress generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap we introduce PixelLM an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM are a novel lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens which encode detailed target-relevant information. With this design PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore we propose a token fusion method to enhance the model's ability to differentiate between multiple targets leading to substantially improved mask quality. To advance research in this area we construct MUSE a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks outperforming well-established methods in multiple benchmarks including MUSE and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code models and datasets will be publicly available.

Cite

Text

Ren et al. "PixelLM: Pixel Reasoning with Large Multimodal Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02491

Markdown

[Ren et al. "PixelLM: Pixel Reasoning with Large Multimodal Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ren2024cvpr-pixellm/) doi:10.1109/CVPR52733.2024.02491

BibTeX

@inproceedings{ren2024cvpr-pixellm,
  title     = {{PixelLM: Pixel Reasoning with Large Multimodal Model}},
  author    = {Ren, Zhongwei and Huang, Zhicheng and Wei, Yunchao and Zhao, Yao and Fu, Dongmei and Feng, Jiashi and Jin, Xiaojie},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {26374-26383},
  doi       = {10.1109/CVPR52733.2024.02491},
  url       = {https://mlanthology.org/cvpr/2024/ren2024cvpr-pixellm/}
}