Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Abstract

We propose a new model for learning bilingual word representations from nonparallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.

Cite

Text

Vulic and Moens. "Bilingual Distributed Word Representations from Document-Aligned Comparable Data." Journal of Artificial Intelligence Research, 2016. doi:10.1613/JAIR.4986

Markdown

[Vulic and Moens. "Bilingual Distributed Word Representations from Document-Aligned Comparable Data." Journal of Artificial Intelligence Research, 2016.](https://mlanthology.org/jair/2016/vulic2016jair-bilingual/) doi:10.1613/JAIR.4986

BibTeX

@article{vulic2016jair-bilingual,
  title     = {{Bilingual Distributed Word Representations from Document-Aligned Comparable Data}},
  author    = {Vulic, Ivan and Moens, Marie-Francine},
  journal   = {Journal of Artificial Intelligence Research},
  year      = {2016},
  pages     = {953-994},
  doi       = {10.1613/JAIR.4986},
  volume    = {55},
  url       = {https://mlanthology.org/jair/2016/vulic2016jair-bilingual/}
}