Themisto: Jupyter-Based Runtime Benchmark

Abstract

In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

Cite

Text

Grotov and Titov. "Themisto: Jupyter-Based Runtime Benchmark." ICLR 2025 Workshops: DL4C, 2025.

Markdown

[Grotov and Titov. "Themisto: Jupyter-Based Runtime Benchmark." ICLR 2025 Workshops: DL4C, 2025.](https://mlanthology.org/iclrw/2025/grotov2025iclrw-themisto/)

BibTeX

@inproceedings{grotov2025iclrw-themisto,
  title     = {{Themisto: Jupyter-Based Runtime Benchmark}},
  author    = {Grotov, Konstantin and Titov, Sergey},
  booktitle = {ICLR 2025 Workshops: DL4C},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/grotov2025iclrw-themisto/}
}