OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Jacob, Darryl C.; Liu, Xinyu; Ye, Muchao; Yuan, Xiaoyong; He, Pan

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Darryl C. Jacob, Xinyu Liu, Muchao Ye, Xiaoyong Yuan, Pan He

TMLR 2026

/tmlr/2026/jacob2026tmlr-oracletsc/

Abstract

Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning–based TSC methods function as black boxes, providing little to no insight into their decisions. Although large language models (LLMs) could provide the needed interpretability through natural language reasoning, they face challenges such as limited memory and difficulty in deriving optimal policies from sparse environmental feedback. Existing TSC methods that apply reinforcement fine-tuning to LLMs face notable training instability and deliver only limited improvements over pretrained models. We attribute this instability to the long-horizon nature of TSC: feedback is sparse and delayed, most control actions yield only marginal changes in congestion metrics, and the resulting weak reward signals interact poorly with policy-gradient optimization. We introduce OracleTSC, which addresses these issues through: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental feedback, and (2) preventing policy degeneracy by maximizing the probability of the chosen answer, which promotes consistent decision-making across multiple responses. Experiments on the standard LibSignal benchmark demonstrate that our approach enables a compact model (LLaMA3-8B) to achieve substantial improvements in traffic flow, with a $75%$ reduction in travel time and $67%$ decrease in queue lengths over the pretrained baseline while preserving interpretability through natural language explanations. Furthermore, the method exhibits strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally distinct intersection with $17%$ lower travel time and $39%$ lower queue length, all without any additional finetuning for the target topology. These findings show that uncertainty-aware reward shaping could stabilize reinforcement fine-tuning and provide a new perspective for improving its effectiveness in TSC tasks.

PDF TMLR OpenReview Semantic Scholar

Cite

Text

Jacob et al. "OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control." Transactions on Machine Learning Research, 2026.

Markdown

[Jacob et al. "OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/jacob2026tmlr-oracletsc/)

BibTeX

@article{jacob2026tmlr-oracletsc,
  title     = {{OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control}},
  author    = {Jacob, Darryl C. and Liu, Xinyu and Ye, Muchao and Yuan, Xiaoyong and He, Pan},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/jacob2026tmlr-oracletsc/}
}