A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Liu, Yongkang; Nie, Ercong; Feng, Shi; Hua, Zheng; Ding, Zifeng; Wang, Daling; Zhang, Yifei; Schütze, Hinrich

doi:10.1007/978-3-031-70344-7_10

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Yongkang Liu, Ercong Nie, Shi Feng, Zheng Hua, Zifeng Ding, Daling Wang, Yifei Zhang, Hinrich Schütze

ECML-PKDD 2024 pp. 162-177

doi:10.1007/978-3-031-70344-7_10 /ecmlpkdd/2024/liu2024ecmlpkdd-unified/

Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data A ugmentation framework for M ulti- D omain D ialogue G eneration, referred to as AMD $^2$ 2 G . The AMD $^2$ 2 G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a de-domaining data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD $^2$ 2 G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD $^2$ 2 G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository ( https://github.com/misonsky/Amdg ).

PDF ECML-PKDD Semantic Scholar

Cite

Text

Liu et al. "A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024. doi:10.1007/978-3-031-70344-7_10

Markdown

[Liu et al. "A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024.](https://mlanthology.org/ecmlpkdd/2024/liu2024ecmlpkdd-unified/) doi:10.1007/978-3-031-70344-7_10

BibTeX

@inproceedings{liu2024ecmlpkdd-unified,
  title     = {{A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation}},
  author    = {Liu, Yongkang and Nie, Ercong and Feng, Shi and Hua, Zheng and Ding, Zifeng and Wang, Daling and Zhang, Yifei and Schütze, Hinrich},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2024},
  pages     = {162-177},
  doi       = {10.1007/978-3-031-70344-7_10},
  url       = {https://mlanthology.org/ecmlpkdd/2024/liu2024ecmlpkdd-unified/}
}