Fast Training Dataset Attribution via In-Context Learning

Abstract

We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.

Cite

Text

Fotouhi et al. "Fast Training Dataset Attribution via In-Context Learning." ICML 2024 Workshops: ICL, 2024.

Markdown

[Fotouhi et al. "Fast Training Dataset Attribution via In-Context Learning." ICML 2024 Workshops: ICL, 2024.](https://mlanthology.org/icmlw/2024/fotouhi2024icmlw-fast/)

BibTeX

@inproceedings{fotouhi2024icmlw-fast,
  title     = {{Fast Training Dataset Attribution via In-Context Learning}},
  author    = {Fotouhi, Milad and Bahadori, Mohammad Taha and Feyisetan, Seyi and Arabshahi, Payman and Heckerman, David},
  booktitle = {ICML 2024 Workshops: ICL},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/fotouhi2024icmlw-fast/}
}