Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning

Abstract

Existing unsupervised video-to-video translation methods fail to produce translated videos which are frame-wise realistic, semantic information preserving and video-level consistent. In this work, we propose a novel unsupervised video-to-video translation model. Our model decomposes the style and the content uses the specialized encoder-decoder structure and propagates the inter-frame information through bidirectional recurrent neural network (RNN) units. The style-content decomposition mechanism enables us to achieve style-consistent video translation results as well as provides us with a good interface for modality flexible translation. In addition, by changing the input frames and style codes incorporated in our translation, we propose a video interpolation loss, which captures temporal information within the sequence to train our building blocks in a self-supervised manner. Our model can produce photo-realistic, spatio-temporal consistent translated videos in a multimodal way. Subjective and objective experimental results validate the superiority of our model over existing methods.

Cite

Text

Liu et al. "Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning." Winter Conference on Applications of Computer Vision, 2021.

Markdown

[Liu et al. "Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning." Winter Conference on Applications of Computer Vision, 2021.](https://mlanthology.org/wacv/2021/liu2021wacv-unsupervised/)

BibTeX

@inproceedings{liu2021wacv-unsupervised,
  title     = {{Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning}},
  author    = {Liu, Kangning and Gu, Shuhang and Romero, Andres and Timofte, Radu},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2021},
  pages     = {1030-1040},
  url       = {https://mlanthology.org/wacv/2021/liu2021wacv-unsupervised/}
}