Combining Frame and GOP Embeddings for Neural Video Representation

Abstract

Implicit neural representations (INRs) were recently proposed as a new video compression paradigm with existing approaches performing on par with HEVC. However such methods only perform well in limited settings e.g. specific model sizes fixed aspect ratios and low-motion videos. We address this issue by proposing T-NeRV a hybrid video INR that combines frame-specific embeddings with GOP-specific features providing a lever for content-specific fine-tuning. We employ entropy-constrained training to jointly optimize our model for rate and distortion and demonstrate that T-NeRV can thereby automatically adjust this lever during training effectively fine-tuning itself to the target content. We evaluate T-NeRV on the UVG dataset where it achieves state-of-the-art results on the video representation task outperforming previous works by up to 3dB PSNR on challenging high-motion sequences. Further our method improves on the compression performance of previous methods and is the first video INR to outperform HEVC on all UVG sequences.

Cite

Text

Saethre et al. "Combining Frame and GOP Embeddings for Neural Video Representation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00884

Markdown

[Saethre et al. "Combining Frame and GOP Embeddings for Neural Video Representation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/saethre2024cvpr-combining/) doi:10.1109/CVPR52733.2024.00884

BibTeX

@inproceedings{saethre2024cvpr-combining,
  title     = {{Combining Frame and GOP Embeddings for Neural Video Representation}},
  author    = {Saethre, Jens Eirik and Azevedo, Roberto and Schroers, Christopher},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {9253-9263},
  doi       = {10.1109/CVPR52733.2024.00884},
  url       = {https://mlanthology.org/cvpr/2024/saethre2024cvpr-combining/}
}