On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Abstract

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., $99.99\%$ of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only $2.4\%$ average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to $32.5\%$ on QQP and $20.1\%$ on WikiQA.

Cite

Text

Mussmann et al. "On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks." NeurIPS 2020 Workshops: HAMLETS, 2020.

Markdown

[Mussmann et al. "On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks." NeurIPS 2020 Workshops: HAMLETS, 2020.](https://mlanthology.org/neuripsw/2020/mussmann2020neuripsw-importance/)

BibTeX

@inproceedings{mussmann2020neuripsw-importance,
  title     = {{On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks}},
  author    = {Mussmann, Stephen and Jia, Robin and Liang, Percy},
  booktitle = {NeurIPS 2020 Workshops: HAMLETS},
  year      = {2020},
  url       = {https://mlanthology.org/neuripsw/2020/mussmann2020neuripsw-importance/}
}