Read and Attend: Temporal Localisation in Sign Language Videos
Abstract
The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.
Cite
Text
Varol et al. "Read and Attend: Temporal Localisation in Sign Language Videos." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01658Markdown
[Varol et al. "Read and Attend: Temporal Localisation in Sign Language Videos." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/varol2021cvpr-read/) doi:10.1109/CVPR46437.2021.01658BibTeX
@inproceedings{varol2021cvpr-read,
title = {{Read and Attend: Temporal Localisation in Sign Language Videos}},
author = {Varol, Gul and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {16857-16866},
doi = {10.1109/CVPR46437.2021.01658},
url = {https://mlanthology.org/cvpr/2021/varol2021cvpr-read/}
}