Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation
Abstract
This paper tackles the problem of efficient and stable video semantic segmentation. While stability has been under-explored, prevalent work in efficient video semantic segmentation uses the keyframe paradigm. They efficiently process videos by only recomputing the low-level features and reusing high-level features computed at selected keyframes. In addition, the reused features stabilize the predictions across frames, thereby improving video consistency. However, dynamic scenes in the video can easily lead to misalignments between reused and recomputed features, which hampers performance. Moreover, relying on feature reuse to improve prediction consistency is brittle; an erroneous alignment of the features can easily lead to unstable predictions. Therefore, the keyframe paradigm exhibits a dilemma between stability and performance. We address this efficiency and stability challenge using a novel yet simple Temporal Feature Correlation (TFC) module. It uses the cosine similarity between two frames’ low-level features to inform the semantic label’s consistency across frames. Specifically, we selectively reuse label-consistent features across frames through linear interpolation and update others through sparse multi-scale deformable attention. As a result, we no longer directly reuse features to improve stability and thus effectively solve feature misalignment. This work provides a significant step towards efficient and stable video semantic segmentation. On the VSPW dataset, our method significantly improves the prediction consistency of image-based methods while being as fast and accurate.
Cite
Text
Lin et al. "Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I4.28132Markdown
[Lin et al. "Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/lin2024aaai-exploring/) doi:10.1609/AAAI.V38I4.28132BibTeX
@inproceedings{lin2024aaai-exploring,
title = {{Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation}},
author = {Lin, Matthieu and Sheng, Jenny and Hu, Yubin and Li, Yangguang and Qi, Lu and Zhao, Andrew and Huang, Gao and Liu, Yong-Jin},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2024},
pages = {3450-3458},
doi = {10.1609/AAAI.V38I4.28132},
url = {https://mlanthology.org/aaai/2024/lin2024aaai-exploring/}
}