Region-Native Visual Tokenization
Abstract
We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER (Reader). In contrast to the majority of previous methods, which represent each image as a grid-shaped tokens map, Reader perceives each image into sequential region-based tokens, with each token corresponding to an object or one part of an object in the image. Specifically, Reader comprises both an encoder and a decoder. The encoder can partition each image into an adaptive number of arbitrary-shaped regions and encode each region into a token. Subsequently, the decoder utilizes this adaptive-length token sequence to reconstruct the original image. Experimental results demonstrate that such region-based token representation possesses two main notable characteristics. Firstly, it achieves highly efficient image encoding. Reader can adaptively use more regions to represent complex areas and fewer regions in simpler ones, thus avoiding information redundancy. Consequently, it achieves superior reconstruction fidelity compared to previous methods, despite using significantly fewer tokens for each image. Secondly, the region-based manner enables manipulation on a local region without causing global changes. As a result, Reader inherently supports diverse image editing operations, including erasing, adding, replacing, and modifying shapes on the objects, and achieves excellent performance in the image editing benchmark of smile transferring. Code is provided at https://github.com/MengyuWang826/Reade
Cite
Text
Wang et al. "Region-Native Visual Tokenization." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72904-1_2Markdown
[Wang et al. "Region-Native Visual Tokenization." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wang2024eccv-regionnative/) doi:10.1007/978-3-031-72904-1_2BibTeX
@inproceedings{wang2024eccv-regionnative,
title = {{Region-Native Visual Tokenization}},
author = {Wang, Mengyu and Huang, Yuyao and Ding, Henghui and Wang, Xinlong and Huang, Tiejun and Zhao, Yao and Wei, Yunchao and Yan, Shuicheng},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72904-1_2},
url = {https://mlanthology.org/eccv/2024/wang2024eccv-regionnative/}
}