Subobject-Level Image Tokenization

Abstract

Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection–a simple task that can be handled well by a compact model–with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC’s segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

Cite

Text

Chen et al. "Subobject-Level Image Tokenization." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Chen et al. "Subobject-Level Image Tokenization." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/chen2025icml-subobjectlevel/)

BibTeX

@inproceedings{chen2025icml-subobjectlevel,
  title     = {{Subobject-Level Image Tokenization}},
  author    = {Chen, Delong and Cahyawijaya, Samuel and Liu, Jianfeng and Wang, Baoyuan and Fung, Pascale},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {7719-7738},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/chen2025icml-subobjectlevel/}
}