Bilingual Distributed Word Representations from Document-Aligned Comparable Data

JAIR 2016 pp. 953-994

doi:10.1613/JAIR.4986 /jair/2016/vulic2016jair-bilingual/

Abstract

We propose a new model for learning bilingual word representations from nonparallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.

PDF JAIR Semantic Scholar

Cite

Text

Vulic and Moens. "Bilingual Distributed Word Representations from Document-Aligned Comparable Data." Journal of Artificial Intelligence Research, 2016. doi:10.1613/JAIR.4986

Markdown

[Vulic and Moens. "Bilingual Distributed Word Representations from Document-Aligned Comparable Data." Journal of Artificial Intelligence Research, 2016.](https://mlanthology.org/jair/2016/vulic2016jair-bilingual/) doi:10.1613/JAIR.4986

BibTeX

@article{vulic2016jair-bilingual,
  title     = {{Bilingual Distributed Word Representations from Document-Aligned Comparable Data}},
  author    = {Vulic, Ivan and Moens, Marie-Francine},
  journal   = {Journal of Artificial Intelligence Research},
  year      = {2016},
  pages     = {953-994},
  doi       = {10.1613/JAIR.4986},
  volume    = {55},
  url       = {https://mlanthology.org/jair/2016/vulic2016jair-bilingual/}
}