PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Abstract

Predictive learning models which aim to predict future frames based on past observations are crucial to constructing world models. These models need to maintain low-level consistency and capture high-level dynamics in unannotated spatiotemporal data. Transitioning from frame-wise to token-wise prediction presents a viable strategy for addressing these needs. How to improve token representation and optimize token decoding presents significant challenges. This paper introduces PredToken a novel predictive framework that addresses these issues by decoupling space-time tokens into distinct components for iterative cascaded decoding. Concretely we first design a "decomposition quantization and reconstruction" schema based on VQGAN to improve the token representation. This scheme disentangles low- and high-frequency representations and employs a dimension-aware quantization model allowing more low-level details to be preserved. Building on this we present a "coarse-to-fine iterative decoding" method. It leverages dynamic soft decoding to refine coarse tokens and static soft decoding for fine tokens enabling more high-level dynamics to be captured. These designs make PredToken produce high-quality predictions. Extensive experiments demonstrate the superiority of our method on various real-world spatiotemporal predictive benchmarks. Furthermore PredToken can also be extended to other visual generative tasks to yield realistic outcomes.

Cite

Text

Nie et al. "PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01718

Markdown

[Nie et al. "PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/nie2024cvpr-predtoken/) doi:10.1109/CVPR52733.2024.01718

BibTeX

@inproceedings{nie2024cvpr-predtoken,
  title     = {{PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding}},
  author    = {Nie, Xuesong and Jin, Haoyuan and Yan, Yunfeng and Chen, Xi and Zhu, Zhihang and Qi, Donglian},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18143-18152},
  doi       = {10.1109/CVPR52733.2024.01718},
  url       = {https://mlanthology.org/cvpr/2024/nie2024cvpr-predtoken/}
}