Neural Networks Incorporating Unlabeled and Partially-Labeled Data for Cross-Domain Chinese Word Segmentation

Abstract

Most existing Chinese word segmentation (CWS) methods are usually supervised. Hence, large-scale annotated domain-specific datasets are needed for training. In this paper, we seek to address the problem of CWS for the resource-poor domains that lack annotated data. A novel neural network model is proposed to incorporate unlabeled and partially-labeled data. To make use of unlabeled data, we combine a bidirectional LSTM segmentation model with two character-level language models using a gate mechanism. These language models can capture co-occurrence information. To make use of partially-labeled data, we modify the original cross entropy loss function of RNN. Experimental results demonstrate that the method performs well on CWS tasks in a series of domains.

Cite

Text

Zhao et al. "Neural Networks Incorporating Unlabeled and Partially-Labeled Data for Cross-Domain Chinese Word Segmentation." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/640

Markdown

[Zhao et al. "Neural Networks Incorporating Unlabeled and Partially-Labeled Data for Cross-Domain Chinese Word Segmentation." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/zhao2018ijcai-neural/) doi:10.24963/IJCAI.2018/640

BibTeX

@inproceedings{zhao2018ijcai-neural,
  title     = {{Neural Networks Incorporating Unlabeled and Partially-Labeled Data for Cross-Domain Chinese Word Segmentation}},
  author    = {Zhao, Lujun and Zhang, Qi and Wang, Peng and Liu, Xiaoyu},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2018},
  pages     = {4602-4608},
  doi       = {10.24963/IJCAI.2018/640},
  url       = {https://mlanthology.org/ijcai/2018/zhao2018ijcai-neural/}
}