Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information
Abstract
We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observation that exploiting stroke-level information is crucial for improving the learning of Chinese word embeddings. Specifically, we design a minimalist approach to exploit such features, by using stroke n-grams, which capture semantic and morphological level information of Chinese words. Through qualitative analysis, we demonstrate that our model is able to extract semantic information that cannot be captured by existing methods. Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms state-of-the-art approaches such as word-based word2vec and GloVe, character-based CWE, component-based JWE and pixel-based GWE.
Cite
Text
Cao et al. "Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information." AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/AAAI.V32I1.12029Markdown
[Cao et al. "Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information." AAAI Conference on Artificial Intelligence, 2018.](https://mlanthology.org/aaai/2018/cao2018aaai-cw/) doi:10.1609/AAAI.V32I1.12029BibTeX
@inproceedings{cao2018aaai-cw,
title = {{Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information}},
author = {Cao, Shaosheng and Lu, Wei and Zhou, Jun and Li, Xiaolong},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2018},
pages = {5053-5061},
doi = {10.1609/AAAI.V32I1.12029},
url = {https://mlanthology.org/aaai/2018/cao2018aaai-cw/}
}