Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Abstract

One of the central tasks of medical text analysis is to extract and structure meaningful information from plain-text clinical documents. Named Entity Recognition (NER) is a sub-task of information extraction that involves identifying predefined entities from unstructured free text. Notably, NER models require large amounts of human-labeled data to train, but human annotation is costly and laborious and often requires medical training. Here, we aim to overcome the shortage of manually annotated data by introducing a training scheme for NER models that uses an existing medical ontology to assign weak labels to entities and provides enhanced domain-specific model adaptation with in-domain continual pretraining. Due to limited human annotation resources, we develop a specific module to collect a more representative test dataset from the data lake than a random selection. To validate our framework, we invite clinicians to annotate the test set. In this way, we construct two Finnish medical NER datasets based on clinical records retrieved from a hospital’s data lake and evaluate the effectiveness of the proposed methods. The code is available at https://github.com/VRCMF/HAM-net.git .

Cite

Text

Sun et al. "Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023. doi:10.1007/978-3-031-43427-3_27

Markdown

[Sun et al. "Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023.](https://mlanthology.org/ecmlpkdd/2023/sun2023ecmlpkdd-weak/) doi:10.1007/978-3-031-43427-3_27

BibTeX

@inproceedings{sun2023ecmlpkdd-weak,
  title     = {{Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition}},
  author    = {Sun, Wei and Ji, Shaoxiong and Denti, Tuulia and Moen, Hans and Kerro, Oleg and Rannikko, Antti and Marttinen, Pekka and Koskinen, Miika},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2023},
  pages     = {444-459},
  doi       = {10.1007/978-3-031-43427-3_27},
  url       = {https://mlanthology.org/ecmlpkdd/2023/sun2023ecmlpkdd-weak/}
}