Siamese BERT-Based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset

Abstract

Web search engines focus on serving highly relevant results within hundreds of milliseconds. Pre-trained language transformer models such as BERT are therefore hard to use in this scenario due to their high computational demands. We present our real-time approach to the document ranking problem leveraging a BERT-based siamese architecture. The model is already deployed in a commercial search engine and it improves production performance by more than 3%. For further research and evaluation, we release DaReCzech, a unique data set of 1.6 million Czech user query-document pairs with manually assigned relevance levels. We also release Small-E-Czech, an Electra-small language model pre-trained on a large Czech corpus. We believe this data will support endeavours both of search relevance and multilingual-focused research communities.

Cite

Text

Kocián et al. "Siamese BERT-Based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I11.21502

Markdown

[Kocián et al. "Siamese BERT-Based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/kocian2022aaai-siamese/) doi:10.1609/AAAI.V36I11.21502

BibTeX

@inproceedings{kocian2022aaai-siamese,
  title     = {{Siamese BERT-Based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset}},
  author    = {Kocián, Matej and Náplava, Jakub and Stancl, Daniel and Kadlec, Vladimír},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {12369-12377},
  doi       = {10.1609/AAAI.V36I11.21502},
  url       = {https://mlanthology.org/aaai/2022/kocian2022aaai-siamese/}
}