Fast Training Dataset Attribution via In-Context Learning
Abstract
We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.
Cite
Text
Fotouhi et al. "Fast Training Dataset Attribution via In-Context Learning." ICML 2024 Workshops: ICL, 2024.Markdown
[Fotouhi et al. "Fast Training Dataset Attribution via In-Context Learning." ICML 2024 Workshops: ICL, 2024.](https://mlanthology.org/icmlw/2024/fotouhi2024icmlw-fast/)BibTeX
@inproceedings{fotouhi2024icmlw-fast,
title = {{Fast Training Dataset Attribution via In-Context Learning}},
author = {Fotouhi, Milad and Bahadori, Mohammad Taha and Feyisetan, Seyi and Arabshahi, Payman and Heckerman, David},
booktitle = {ICML 2024 Workshops: ICL},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/fotouhi2024icmlw-fast/}
}