ConTextual Masked Auto-Encoder for Dense Passage Retrieval

Abstract

Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae.

Cite

Text

Wu et al. "ConTextual Masked Auto-Encoder for Dense Passage Retrieval." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I4.25598

Markdown

[Wu et al. "ConTextual Masked Auto-Encoder for Dense Passage Retrieval." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/wu2023aaai-contextual/) doi:10.1609/AAAI.V37I4.25598

BibTeX

@inproceedings{wu2023aaai-contextual,
  title     = {{ConTextual Masked Auto-Encoder for Dense Passage Retrieval}},
  author    = {Wu, Xing and Ma, Guangyuan and Lin, Meng and Lin, Zijia and Wang, Zhongyuan and Hu, Songlin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {4738-4746},
  doi       = {10.1609/AAAI.V37I4.25598},
  url       = {https://mlanthology.org/aaai/2023/wu2023aaai-contextual/}
}