Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching

Abstract

We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations---which are often freely and readily available. We can encode such word translations as hard constraints in the model's objective functions; however, we find that we can more naturally constrain the space by allowing words in one language to borrow distributional statistics from context words in another language. We achieve this via a process we term artificial code-switching. As the name suggests, we induce code-switching so that words across multiple languages appear in contexts together. Not only do embedding models trained on code-switched data learn common cross-lingual structure, the common structure allows an NLP model trained in a source language to generalize to multiple target languages (achieving up to 80% of the accuracy of models trained with target-language data).

Cite

Text

Wick et al. "Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching." AAAI Conference on Artificial Intelligence, 2016. doi:10.1609/AAAI.V30I1.10360

Markdown

[Wick et al. "Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching." AAAI Conference on Artificial Intelligence, 2016.](https://mlanthology.org/aaai/2016/wick2016aaai-minimally/) doi:10.1609/AAAI.V30I1.10360

BibTeX

@inproceedings{wick2016aaai-minimally,
  title     = {{Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching}},
  author    = {Wick, Michael L. and Kanani, Pallika and Pocock, Adam Craig},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2016},
  pages     = {2849-2855},
  doi       = {10.1609/AAAI.V30I1.10360},
  url       = {https://mlanthology.org/aaai/2016/wick2016aaai-minimally/}
}