Boosting Monolingual Sentence Representation with Large-Scale Parallel Translation Datasets

Abstract

Although contrastive learning greatly improves sentence representation, its performance is still limited by the size of existing monolingual datasets. So can semantically highly correlated massively parallel translation pairs be used for pre-training of monolingual models? This paper proposes an exploration of this. We leverage parallel translated sentence pairs to learn single-sentence sentence embeddings and demonstrate superior performance in balancing alignment and consistency. We achieve new state-of-the-art performance on the mean score of Standard Semantic Text Similarity (STS), outperforming both SimCSE and Sentence-T5.

Cite

Text

Wang et al. "Boosting Monolingual Sentence Representation with Large-Scale Parallel Translation Datasets." ICML 2022 Workshops: Pre-Training, 2022.

Markdown

[Wang et al. "Boosting Monolingual Sentence Representation with Large-Scale Parallel Translation Datasets." ICML 2022 Workshops: Pre-Training, 2022.](https://mlanthology.org/icmlw/2022/wang2022icmlw-boosting/)

BibTeX

@inproceedings{wang2022icmlw-boosting,
  title     = {{Boosting Monolingual Sentence Representation with Large-Scale Parallel Translation Datasets}},
  author    = {Wang, Jue and Wang, Haofan and Wu, Xing and Gao, Chaochen and Zhang, Debing},
  booktitle = {ICML 2022 Workshops: Pre-Training},
  year      = {2022},
  url       = {https://mlanthology.org/icmlw/2022/wang2022icmlw-boosting/}
}