Multilingual Topic Models for Unaligned Text

Abstract

We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

Cite

Text

Boyd-Graber and Blei. "Multilingual Topic Models for Unaligned Text." Conference on Uncertainty in Artificial Intelligence, 2009.

Markdown

[Boyd-Graber and Blei. "Multilingual Topic Models for Unaligned Text." Conference on Uncertainty in Artificial Intelligence, 2009.](https://mlanthology.org/uai/2009/boydgraber2009uai-multilingual/)

BibTeX

@inproceedings{boydgraber2009uai-multilingual,
  title     = {{Multilingual Topic Models for Unaligned Text}},
  author    = {Boyd-Graber, Jordan L. and Blei, David M.},
  booktitle = {Conference on Uncertainty in Artificial Intelligence},
  year      = {2009},
  pages     = {75-82},
  url       = {https://mlanthology.org/uai/2009/boydgraber2009uai-multilingual/}
}