Sequential Representation Learning via Static-Dynamic Conditional Disentanglement
Abstract
This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-indep- endent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.
Cite
Text
Simon et al. "Sequential Representation Learning via Static-Dynamic Conditional Disentanglement." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73226-3_7Markdown
[Simon et al. "Sequential Representation Learning via Static-Dynamic Conditional Disentanglement." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/simon2024eccv-sequential/) doi:10.1007/978-3-031-73226-3_7BibTeX
@inproceedings{simon2024eccv-sequential,
title = {{Sequential Representation Learning via Static-Dynamic Conditional Disentanglement}},
author = {Simon, Mathieu Cyrille and Frossard, Pascal and De Vleeschouwer, Christophe},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73226-3_7},
url = {https://mlanthology.org/eccv/2024/simon2024eccv-sequential/}
}