On Learning Linear Dynamical Systems in Context with Attention Layers
Abstract
This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian. Importantly, this non-i.i.d. data setting is a significant step towards modeling real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we uncover a connection to a generalization of the Preconditioned Conjugate Gradient method for larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers’ expressivity as in-context learners and offer plausible hypotheses for recent observations that place their performance on par with that of the Kalman Filter — the optimal model-dependent learner for this setting.
Cite
Text
Vladarean et al. "On Learning Linear Dynamical Systems in Context with Attention Layers." International Conference on Learning Representations, 2026.Markdown
[Vladarean et al. "On Learning Linear Dynamical Systems in Context with Attention Layers." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/vladarean2026iclr-learning/)BibTeX
@inproceedings{vladarean2026iclr-learning,
title = {{On Learning Linear Dynamical Systems in Context with Attention Layers}},
author = {Vladarean, Maria-Luiza and Zhang, Xuhui and Sra, Suvrit},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/vladarean2026iclr-learning/}
}