Understanding Multi-Task Activities from Single-Task Videos
Abstract
(MT-TAS), a novel paradigm that addresses the challenges of interleaved actions when performing multiple tasks simultaneously. Traditional action segmentation models, trained on single-task videos, struggle to handle task switches and complex scenes inherent in multi-task scenarios. To overcome these challenges, our MT-TAS approach synthesizes multi-task video data from single-task sources using our Multi-task Sequence Blending and Segment Boundary Learning modules. Additionally, we propose to dynamically isolate foreground and background elements within video frames, addressing the intricacies of object layouts in multi-task scenarios and enabling a new two-stage temporal action segmentation framework with Foreground-Aware Action Refinement. Also, we introduce the Multi-task Egocentric Kitchen Activities (MEKA) dataset, containing 12 hours of egocentric multi-task videos, to rigorously benchmark MT-TAS models. Extensive experiments demonstrate that our framework effectively bridges the gap between single-task training and multi-task testing, advancing temporal action segmentation with state-of-the-art performance in complex environments.
Cite
Text
Shen and Elhamifar. "Understanding Multi-Task Activities from Single-Task Videos." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01781Markdown
[Shen and Elhamifar. "Understanding Multi-Task Activities from Single-Task Videos." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/shen2025cvpr-understanding/) doi:10.1109/CVPR52734.2025.01781BibTeX
@inproceedings{shen2025cvpr-understanding,
title = {{Understanding Multi-Task Activities from Single-Task Videos}},
author = {Shen, Yuhan and Elhamifar, Ehsan},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {19120-19131},
doi = {10.1109/CVPR52734.2025.01781},
url = {https://mlanthology.org/cvpr/2025/shen2025cvpr-understanding/}
}