BIRD-INTERACT: Re-Imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions
Abstract
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short of capturing this complexity, either by treating conversation histories as static context or by limiting evaluation to narrow, read-only (SELECT-ONLY) operations, thereby potentially failing to reflect the challenges encountered in production-grade database assistant. In this work, we introduce BIRD-INTERACT, a benchmark that restores this missing realism through: (1) a **comprehensive interaction environment** that couples each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision; (2) two **evaluation settings** reflecting real-world interaction settings which contain a pre-defined conversational protocol (c-Interact) and a more open-ended agentic setting (a-Interact) in which the model autonomously decides when to query the user simulator or explore the DB environment; (3) a **challenging task suite** that covers the full CRUD spectrum for both business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks, requiring LLMs to engage in dynamic interaction. The suite is organized into two sets: a full set (BIRD-INTERACT-FULL) of 600 tasks which unfold up to 11,796 dynamic interactions for a comprehensive overview of performance and a lite set (BIRD-INTERACT-LITE) of 300 tasks, with simplified databases for detailed behavioral analysis of interactions, and fast development of methods. Our empirical results highlight the difficulty of BIRD-INTERACT: the most recent flagship model GPT-5 completes only 8.67% of tasks in the c-Interact setting and 17.00% in the a-Interact setting on the full task suite. Further analysis via memory grafting and Interaction Test-time Scaling (ITS) validates the importance of effective interaction for achieving success in dynamic text-to-SQL tasks.
Cite
Text
Huo et al. "BIRD-INTERACT: Re-Imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions." International Conference on Learning Representations, 2026.Markdown
[Huo et al. "BIRD-INTERACT: Re-Imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huo2026iclr-birdinteract/)BibTeX
@inproceedings{huo2026iclr-birdinteract,
title = {{BIRD-INTERACT: Re-Imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions}},
author = {Huo, Nan and Xu, Xiaohan and Li, Jinyang and Jacobsson, Per and Lin, Shipei and Qin, Bowen and Hui, Binyuan and Li, Xiaolong and Qu, Ge and Si, Shuzheng and Han, Linheng and Alexander, Edward and Zhu, Xintong and Qin, Rui and Yu, Ruihan and Jin, Yiyao and Zhou, Feige and Zhong, Weihao and Chen, Yun and Liu, Hongyu and Ma, Chenhao and Ozcan, Fatma and Papakonstantinou, Yannis and Cheng, Reynold},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/huo2026iclr-birdinteract/}
}