PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding
Abstract
Predictive learning models which aim to predict future frames based on past observations are crucial to constructing world models. These models need to maintain low-level consistency and capture high-level dynamics in unannotated spatiotemporal data. Transitioning from frame-wise to token-wise prediction presents a viable strategy for addressing these needs. How to improve token representation and optimize token decoding presents significant challenges. This paper introduces PredToken a novel predictive framework that addresses these issues by decoupling space-time tokens into distinct components for iterative cascaded decoding. Concretely we first design a "decomposition quantization and reconstruction" schema based on VQGAN to improve the token representation. This scheme disentangles low- and high-frequency representations and employs a dimension-aware quantization model allowing more low-level details to be preserved. Building on this we present a "coarse-to-fine iterative decoding" method. It leverages dynamic soft decoding to refine coarse tokens and static soft decoding for fine tokens enabling more high-level dynamics to be captured. These designs make PredToken produce high-quality predictions. Extensive experiments demonstrate the superiority of our method on various real-world spatiotemporal predictive benchmarks. Furthermore PredToken can also be extended to other visual generative tasks to yield realistic outcomes.
Cite
Text
Nie et al. "PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01718Markdown
[Nie et al. "PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/nie2024cvpr-predtoken/) doi:10.1109/CVPR52733.2024.01718BibTeX
@inproceedings{nie2024cvpr-predtoken,
title = {{PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding}},
author = {Nie, Xuesong and Jin, Haoyuan and Yan, Yunfeng and Chen, Xi and Zhu, Zhihang and Qi, Donglian},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {18143-18152},
doi = {10.1109/CVPR52733.2024.01718},
url = {https://mlanthology.org/cvpr/2024/nie2024cvpr-predtoken/}
}