Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Abstract

We propose a novel approach to the action segmentation task for long untrimmed videos based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches our method does not require knowing the action order for a video to attain temporal consistency. Furthermore our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast 50-Salads YouTube Instructions and Desktop Assembly datasets yielding state-of-the-art results for the unsupervised video action segmentation task.

Cite

Text

Xu and Gould. "Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01385

Markdown

[Xu and Gould. "Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/xu2024cvpr-temporally/) doi:10.1109/CVPR52733.2024.01385

BibTeX

@inproceedings{xu2024cvpr-temporally,
  title     = {{Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation}},
  author    = {Xu, Ming and Gould, Stephen},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14618-14627},
  doi       = {10.1109/CVPR52733.2024.01385},
  url       = {https://mlanthology.org/cvpr/2024/xu2024cvpr-temporally/}
}