YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Abstract

Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.

Cite

Text

Tanzer and Zhang. "YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus." International Conference on Learning Representations, 2025.

Markdown

[Tanzer and Zhang. "YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/tanzer2025iclr-youtubesl25/)

BibTeX

@inproceedings{tanzer2025iclr-youtubesl25,
  title     = {{YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus}},
  author    = {Tanzer, Garrett and Zhang, Biao},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/tanzer2025iclr-youtubesl25/}
}