SQuAD-SRC: A Dataset for Multi-Accent Spoken Reading Comprehension

Abstract

Spoken Reading Comprehension (SRC) is a challenging problem in spoken natural language retrieval, which automatically extracts the answer from the text-form contents according to the audio-form question. However, the existing spoken question answering approaches are mainly based on synthetically generated audio-form data, which may be ineffectively applied for multi-accent spoken question answering directly in many real-world applications. In this paper, we construct a large-scale multi-accent human spoken dataset SQuAD-SRC, in order to study the problem of multi-accent spoken reading comprehension. We choose 24 native English speakers from six different countries with various English accents and construct audio-form questions to the correspondent text-form contents by the chosen speakers. The dataset consists of 98,169 spoken question answering pairs and 20,963 passages from the popular machine reading comprehension dataset SQuAD. We present a statistical analysis of our SQuAD-SRC dataset and conduct extensive experiments on it by comparing cascaded SRC approaches and the enhanced end-to-end ones. Moreover, we explore various adaption strategies to improve the SRC performance, especially for multi-accent spoken questions.

Cite

Text

Tang and Tung. "SQuAD-SRC: A Dataset for Multi-Accent Spoken Reading Comprehension." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/578

Markdown

[Tang and Tung. "SQuAD-SRC: A Dataset for Multi-Accent Spoken Reading Comprehension." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/tang2023ijcai-squad/) doi:10.24963/IJCAI.2023/578

BibTeX

@inproceedings{tang2023ijcai-squad,
  title     = {{SQuAD-SRC: A Dataset for Multi-Accent Spoken Reading Comprehension}},
  author    = {Tang, Yixuan and Tung, Anthony K. H.},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {5206-5214},
  doi       = {10.24963/IJCAI.2023/578},
  url       = {https://mlanthology.org/ijcai/2023/tang2023ijcai-squad/}
}