S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-Shot Segmentation

Abstract

Real-time understanding in video is crucial in various AI applications such as autonomous driving. This work presents a fast single-shot segmentation strategy for video scene understanding. The proposed net, called S3-Net, quickly locates and segments target sub-scenes, meanwhile extracts structured time-series semantic features as inputs to an LSTM-based spatio-temporal model. Utilizing tensorization and quantization techniques, S3-Net is intended to be lightweight for edge computing. Experiments using CityScapes, UCF11, HMDB51 and MOMENTS datasets demonstrate that the proposed S3-Net achieves an accuracy improvement of 8.1% versus the 3D-CNN based approach on UCF11, a storage reduction of 6.9x and an inference speed of 22.8 FPS on CityScapes with a GTX1080Ti GPU.

Cite

Text

Cheng et al. "S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-Shot Segmentation." Winter Conference on Applications of Computer Vision, 2021.

Markdown

[Cheng et al. "S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-Shot Segmentation." Winter Conference on Applications of Computer Vision, 2021.](https://mlanthology.org/wacv/2021/cheng2021wacv-s3net/)

BibTeX

@inproceedings{cheng2021wacv-s3net,
  title     = {{S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-Shot Segmentation}},
  author    = {Cheng, Yuan and Yang, Yuchao and Chen, Hai-Bao and Wong, Ngai and Yu, Hao},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2021},
  pages     = {3329-3337},
  url       = {https://mlanthology.org/wacv/2021/cheng2021wacv-s3net/}
}