LLMs Are Good Sign Language Translators

Abstract

Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora we aim to harness off-the-shelf LLMs to handle SLT. In this paper we regularize the sign videos to embody linguistic characteristics of spoken language and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.

Cite

Text

Gong et al. "LLMs Are Good Sign Language Translators." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01738

Markdown

[Gong et al. "LLMs Are Good Sign Language Translators." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/gong2024cvpr-llms/) doi:10.1109/CVPR52733.2024.01738

BibTeX

@inproceedings{gong2024cvpr-llms,
  title     = {{LLMs Are Good Sign Language Translators}},
  author    = {Gong, Jia and Foo, Lin Geng and He, Yixuan and Rahmani, Hossein and Liu, Jun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18362-18372},
  doi       = {10.1109/CVPR52733.2024.01738},
  url       = {https://mlanthology.org/cvpr/2024/gong2024cvpr-llms/}
}