Region-Native Visual Tokenization

Abstract

We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER (Reader). In contrast to the majority of previous methods, which represent each image as a grid-shaped tokens map, Reader perceives each image into sequential region-based tokens, with each token corresponding to an object or one part of an object in the image. Specifically, Reader comprises both an encoder and a decoder. The encoder can partition each image into an adaptive number of arbitrary-shaped regions and encode each region into a token. Subsequently, the decoder utilizes this adaptive-length token sequence to reconstruct the original image. Experimental results demonstrate that such region-based token representation possesses two main notable characteristics. Firstly, it achieves highly efficient image encoding. Reader can adaptively use more regions to represent complex areas and fewer regions in simpler ones, thus avoiding information redundancy. Consequently, it achieves superior reconstruction fidelity compared to previous methods, despite using significantly fewer tokens for each image. Secondly, the region-based manner enables manipulation on a local region without causing global changes. As a result, Reader inherently supports diverse image editing operations, including erasing, adding, replacing, and modifying shapes on the objects, and achieves excellent performance in the image editing benchmark of smile transferring. Code is provided at https://github.com/MengyuWang826/Reade

Cite

Text

Wang et al. "Region-Native Visual Tokenization." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72904-1_2

Markdown

[Wang et al. "Region-Native Visual Tokenization." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wang2024eccv-regionnative/) doi:10.1007/978-3-031-72904-1_2

BibTeX

@inproceedings{wang2024eccv-regionnative,
  title     = {{Region-Native Visual Tokenization}},
  author    = {Wang, Mengyu and Huang, Yuyao and Ding, Henghui and Wang, Xinlong and Huang, Tiejun and Zhao, Yao and Wei, Yunchao and Yan, Shuicheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72904-1_2},
  url       = {https://mlanthology.org/eccv/2024/wang2024eccv-regionnative/}
}