Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge
Abstract
Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. In this study, we show that typical RAG methods may leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel method that leverages a knowledge graph to down-sample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.
Cite
Text
Delile et al. "Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge." ICML 2024 Workshops: ML4LMS, 2024.Markdown
[Delile et al. "Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge." ICML 2024 Workshops: ML4LMS, 2024.](https://mlanthology.org/icmlw/2024/delile2024icmlw-graphbased/)BibTeX
@inproceedings{delile2024icmlw-graphbased,
title = {{Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge}},
author = {Delile, Julien and Mukherjee, Srayanta and Van Pamel, Anton and Zhukov, Leonid},
booktitle = {ICML 2024 Workshops: ML4LMS},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/delile2024icmlw-graphbased/}
}