TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Abstract

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.

Cite

Text

Jiang et al. "TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories." ICLR 2025 Workshops: DL4C, 2025.

Markdown

[Jiang et al. "TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories." ICLR 2025 Workshops: DL4C, 2025.](https://mlanthology.org/iclrw/2025/jiang2025iclrw-typybench/)

BibTeX

@inproceedings{jiang2025iclrw-typybench,
  title     = {{TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories}},
  author    = {Jiang, Yuhe and Deng, Xun and Yang, Jiacheng and Dong, Honghua and Pekhimenko, Gennady and Long, Fan and Si, Xujie},
  booktitle = {ICLR 2025 Workshops: DL4C},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/jiang2025iclrw-typybench/}
}