All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
Abstract
We introduce AiT, a unified output representation for various vision tasks, which is a crucial step towards general-purpose vision task solvers. Despite the challenges posed by the high-dimensional and task-specific outputs, we showcase the potential of using discrete representation (VQ-VAE) to model the dense outputs of many computer vision tasks as a sequence of discrete tokens. This is inspired by the established ability of VQ-VAE to conserve the structures spanning multiple pixels using few discrete codes. To that end, we present a modified shallower architecture for VQ-VAE that improves efficiency while keeping prediction accuracy. Our approach also incorporates uncertainty into the decoding process by using a soft fusion of the codebook entries, providing a more stable training process, which notably improved prediction accuracy. Our evaluation of AiT on depth estimation and instance segmentation tasks, with both continuous and discrete labels, demonstrates its superiority compared to other unified models. The code and models are available at https://github.com/SwinTransformer/AiT.
Cite
Text
Ning et al. "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01822Markdown
[Ning et al. "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/ning2023iccv-all/) doi:10.1109/ICCV51070.2023.01822BibTeX
@inproceedings{ning2023iccv-all,
title = {{All in Tokens: Unifying Output Space of Visual Tasks via Soft Token}},
author = {Ning, Jia and Li, Chen and Zhang, Zheng and Wang, Chunyu and Geng, Zigang and Dai, Qi and He, Kun and Hu, Han},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {19900-19910},
doi = {10.1109/ICCV51070.2023.01822},
url = {https://mlanthology.org/iccv/2023/ning2023iccv-all/}
}