Is Isotropy a Good Proxy for Generalization in Time Series Forecasting with Transformers?
Abstract
Vector representations of contextual embeddings learned by transformer-based models have been shown to be effective even for downstream tasks in \emph{numerical domains} such as time series forecasting. Their success in capturing long-range dependencies and contextual semantics has led to broad adoption across architectures. But at the same time, there is little theoretical understanding of when transformers, both autoregressive and non-autoregressive, generalize well to forecasting tasks. This paper addresses this gap through an analysis of isotropy in contextual embedding space. Specifically, we study a log-linear model as a simplified abstraction for studying hidden representations in transformer-based models. In this formulation, time series embeddings are mapped to predictive outputs through a softmax layer, providing a tractable lens for analyzing generalization. We show that state-of-the-art performance requires embeddings to possess a structure that accounts for the shift-invariance of the softmax function. By examining the gradient structure of self-attention, we demonstrate how isotropy preserves representation structure, resolves the shift-invariance problem, and provides insights into model reliability and generalization. Experiments across $22$ different numerical datasets and $5$ different transformer-based models show that data characteristics and architectural choices significantly affect isotropy, which in turn directly influences forecasting performance. This establishes isotropy as a theoretically grounded and empirically validated indicator of generalization and reliability in time series forecasting. The code for the isotropy analysis and all data are publicly available.
Cite
Text
Shelim et al. "Is Isotropy a Good Proxy for Generalization in Time Series Forecasting with Transformers?." Transactions on Machine Learning Research, 2025.Markdown
[Shelim et al. "Is Isotropy a Good Proxy for Generalization in Time Series Forecasting with Transformers?." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/shelim2025tmlr-isotropy/)BibTeX
@article{shelim2025tmlr-isotropy,
title = {{Is Isotropy a Good Proxy for Generalization in Time Series Forecasting with Transformers?}},
author = {Shelim, Rashed and Xu, Shengzhe and Saad, Walid and Ramakrishnan, Naren},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/shelim2025tmlr-isotropy/}
}