Neural Sign Language Synthesis: Words Are Our Glosses
Abstract
This paper deals with a text-to-video sign language synthesis. Instead of direct video production, we focused on skeletal models production. Our main goal in this paper was to design the first fully end-to-end automatic sign language synthesis system trained only on available free data (daily TV broadcasting). Thus, we excluded any manual video annotation. Furthermore, our designed approach even do not rely on any video segmentation. A proposed feed-forward transformer and recurrent transformer were investigated. To improve the performance of our sequence-to-sequence transformer, soft non-monotonic attention was employed in our training process. A benefit of character-level features was compared with word-level features. Besides a novel approach to sign language synthesis, we also present a gradient-descend-based method for the skeletal model estimation improvement. This improvement not only smooths skeletal models and interpolates missing bones but it also creates 3D skeletal models from 2D models. We focused our experiments on a weather forecasting dataset in the Czech Sign Language.
Cite
Text
Zelinka and Kanis. "Neural Sign Language Synthesis: Words Are Our Glosses." Winter Conference on Applications of Computer Vision, 2020.Markdown
[Zelinka and Kanis. "Neural Sign Language Synthesis: Words Are Our Glosses." Winter Conference on Applications of Computer Vision, 2020.](https://mlanthology.org/wacv/2020/zelinka2020wacv-neural/)BibTeX
@inproceedings{zelinka2020wacv-neural,
title = {{Neural Sign Language Synthesis: Words Are Our Glosses}},
author = {Zelinka, Jan and Kanis, Jakub},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2020},
url = {https://mlanthology.org/wacv/2020/zelinka2020wacv-neural/}
}