Self-Supervised ControlNet with Spatio-Temporal Mamba for Real-World Video Super-Resolution

Abstract

Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

Cite

Text

Shi et al. "Self-Supervised ControlNet with Spatio-Temporal Mamba for Real-World Video Super-Resolution." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00692

Markdown

[Shi et al. "Self-Supervised ControlNet with Spatio-Temporal Mamba for Real-World Video Super-Resolution." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/shi2025cvpr-selfsupervised/) doi:10.1109/CVPR52734.2025.00692

BibTeX

@inproceedings{shi2025cvpr-selfsupervised,
  title     = {{Self-Supervised ControlNet with Spatio-Temporal Mamba for Real-World Video Super-Resolution}},
  author    = {Shi, Shijun and Xu, Jing and Lu, Lijing and Li, Zhihang and Hu, Kai},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {7385-7395},
  doi       = {10.1109/CVPR52734.2025.00692},
  url       = {https://mlanthology.org/cvpr/2025/shi2025cvpr-selfsupervised/}
}