KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Lai, Eugenie; Vitagliano, Gerardo; Zhang, Ziyu; Chabra, Om; Sudhir, Sivaprasad; Zeng, Anna; Zabreyko, Anton A.; Li, Chenning; Kossmann, Ferdi; Ding, Jialin; Chen, Jun; Markakis, Markos; Russo, Matthew; Wang, Weiyang; Wu, Ziniu; Cafarella, Mike; Cao, Lei; Madden, Samuel; Kraska, Tim

KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Mike Cafarella, Lei Cao, Samuel Madden, Tim Kraska

ICLR 2026

/iclr/2026/lai2026iclr-kramabench/

Abstract

Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single-agent reference framework DS-Guru, alongside both open- and closed-source single- and multi-agent systems, and find that while current agentic systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 55% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Lai et al. "KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes." International Conference on Learning Representations, 2026.

Markdown

[Lai et al. "KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lai2026iclr-kramabench/)

BibTeX

@inproceedings{lai2026iclr-kramabench,
  title     = {{KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes}},
  author    = {Lai, Eugenie and Vitagliano, Gerardo and Zhang, Ziyu and Chabra, Om and Sudhir, Sivaprasad and Zeng, Anna and Zabreyko, Anton A. and Li, Chenning and Kossmann, Ferdi and Ding, Jialin and Chen, Jun and Markakis, Markos and Russo, Matthew and Wang, Weiyang and Wu, Ziniu and Cafarella, Mike and Cao, Lei and Madden, Samuel and Kraska, Tim},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lai2026iclr-kramabench/}
}