Pseudo-Aligned Multilingual Corpora

Abstract

In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical-as opposed to exact-correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topic-based graph for each language. Then, given exact correspondences between a subset of documents, we project the unaligned documents into a shared lower-dimensional space. We demonstrate that close documents in this lower-dimensional space tend to share the same topic. This has applications in machine translation and cross-lingual information analysis. Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound. Our technique requires no linguistic knowledge of the corpus. On average when 10% of the corpus consists of exact correspondences, an on-topic correspondence occurs within the top 5 foreign neighbors in the lower-dimensional space while the exact correspondence occurs within the top 10 foreign neighbors in this this space. We also show how to substantially improve these results with a novel method for incorporating language-independent information.

Cite

Text

Diaz and Metzler. "Pseudo-Aligned Multilingual Corpora." International Joint Conference on Artificial Intelligence, 2007.

Markdown

[Diaz and Metzler. "Pseudo-Aligned Multilingual Corpora." International Joint Conference on Artificial Intelligence, 2007.](https://mlanthology.org/ijcai/2007/diaz2007ijcai-pseudo/)

BibTeX

@inproceedings{diaz2007ijcai-pseudo,
  title     = {{Pseudo-Aligned Multilingual Corpora}},
  author    = {Diaz, Fernando and Metzler, Donald},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2007},
  pages     = {2727-2732},
  url       = {https://mlanthology.org/ijcai/2007/diaz2007ijcai-pseudo/}
}