Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

Abstract

We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a manner that scales to more challenging dynamics, and to condition on other data, such as audio. We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high transition probabilities to generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends to an audio-conditioned setting without requiring any fine-tuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

Cite

Text

Narasimhan et al. "Strumming to the Beat: Audio-Conditioned Contrastive Video Textures." Winter Conference on Applications of Computer Vision, 2022.

Markdown

[Narasimhan et al. "Strumming to the Beat: Audio-Conditioned Contrastive Video Textures." Winter Conference on Applications of Computer Vision, 2022.](https://mlanthology.org/wacv/2022/narasimhan2022wacv-strumming/)

BibTeX

@inproceedings{narasimhan2022wacv-strumming,
  title     = {{Strumming to the Beat: Audio-Conditioned Contrastive Video Textures}},
  author    = {Narasimhan, Medhini and Ginosar, Shiry and Owens, Andrew and Efros, Alexei A. and Darrell, Trevor},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2022},
  pages     = {3761-3770},
  url       = {https://mlanthology.org/wacv/2022/narasimhan2022wacv-strumming/}
}