Speech-to-Visualization: Toward End-to-End Speech-Driven Data Visualization Generation from Natural Language Questions

Zhang, Haodi; Zhang, Xinhe; Zhou, Jihua; Wu, Kaishun; Song, Yuanfeng; Wong, Raymond Chi-Wing

doi:10.1007/978-3-032-06109-6_25

Speech-to-Visualization: Toward End-to-End Speech-Driven Data Visualization Generation from Natural Language Questions

Haodi Zhang, Xinhe Zhang, Jihua Zhou, Kaishun Wu, Yuanfeng Song, Raymond Chi-Wing Wong

ECML-PKDD 2025 pp. 437-453

doi:10.1007/978-3-032-06109-6_25 /ecmlpkdd/2025/zhang2025ecmlpkdd-speechtovisualization/

Abstract

Data visualization (DV) has evolved rapidly, transforming intricate datasets into accessible visual re presentations. However, the intricate grammar of DV languages, such as Vega-Lite, presents a substantial barrier for beginners and users without technical backgrounds. To address this challenge, extensive research has focused on developing models that can translate natural language questions (NLQs) into DV languages, a process formally known as text-to-visualization in the field. With the recent development of speech-related technologies, particularly Acoustic Speech Recognition (ASR), voice-based interaction has become a growing trend in real-world applications. In this paper, we introduce speech-to-vis, a novel task that translates speech-form NLQs into data visualizations. To address the scarcity of relevant datasets, we present SpeechNVBench , the first manually annotated dataset specifically designed for this field. Our research reveals that the intuitive cascaded approach (i.e., ASR followed by text-to-vis) suffers from error propagation issues, where small errors in earlier stages lead to larger errors in subsequent stages. In response, we introduce SpeechVisNet , the first end-to-end neural architecture that directly translates speech-form NLQs into DVs. SpeechVisNet incorporates advanced structures like a DV-aware decoder to ensure reliable output. Furthermore, to mitigate the modality gap between speech-modality questions and text-modality data schema, we explore bridging techniques to align them. Experimentation on our proposed dataset demonstrates SpeechVisNet’s competitive edge against various strong baselines. This work aims to drive innovation in human-machine interfaces, enhancing the efficiency and accessibility of DV tools across various domains.

PDF ECML-PKDD Semantic Scholar

Cite

Text

Zhang et al. "Speech-to-Visualization: Toward End-to-End Speech-Driven Data Visualization Generation from Natural Language Questions." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025. doi:10.1007/978-3-032-06109-6_25

Markdown

[Zhang et al. "Speech-to-Visualization: Toward End-to-End Speech-Driven Data Visualization Generation from Natural Language Questions." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025.](https://mlanthology.org/ecmlpkdd/2025/zhang2025ecmlpkdd-speechtovisualization/) doi:10.1007/978-3-032-06109-6_25

BibTeX

@inproceedings{zhang2025ecmlpkdd-speechtovisualization,
  title     = {{Speech-to-Visualization: Toward End-to-End Speech-Driven Data Visualization Generation from Natural Language Questions}},
  author    = {Zhang, Haodi and Zhang, Xinhe and Zhou, Jihua and Wu, Kaishun and Song, Yuanfeng and Wong, Raymond Chi-Wing},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2025},
  pages     = {437-453},
  doi       = {10.1007/978-3-032-06109-6_25},
  url       = {https://mlanthology.org/ecmlpkdd/2025/zhang2025ecmlpkdd-speechtovisualization/}
}