Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos
Abstract
Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.
Cite
Text
Hu et al. "Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02167Markdown
[Hu et al. "Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/hu2023cvpr-efficient/) doi:10.1109/CVPR52729.2023.02167BibTeX
@inproceedings{hu2023cvpr-efficient,
title = {{Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos}},
author = {Hu, Yubin and He, Yuze and Li, Yanghao and Li, Jisheng and Han, Yuxing and Wen, Jiangtao and Liu, Yong-Jin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {22627-22637},
doi = {10.1109/CVPR52729.2023.02167},
url = {https://mlanthology.org/cvpr/2023/hu2023cvpr-efficient/}
}