TS-Attn: Temporal-Wise Separable Attention for Multi-Event Video Generation

Zhang, Hongyu; Deng, Yufan; Pan, Zilin; Jiang, Peng-Tao; Li, Bo; Hou, Qibin; Dong, Zhen; Dou, Zhiyang; Zhou, Daquan

TS-Attn: Temporal-Wise Separable Attention for Multi-Event Video Generation

Hongyu Zhang, Yufan Deng, Zilin Pan, Peng-Tao Jiang, Bo Li, Qibin Hou, Zhen Dong, Zhiyang Dou, Daquan Zhou

ICLR 2026

/iclr/2026/zhang2026iclr-tsattn/

Abstract

Generating high-quality videos from complex temporal descriptions, which refer to prompts containing multiple sequential actions, remains a significant challenge. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt following capability. We attribute this problem to two primary causes: temporal misalignment between video content and the prompt, and conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and video demos are available in the supplementary materials.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "TS-Attn: Temporal-Wise Separable Attention for Multi-Event Video Generation." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "TS-Attn: Temporal-Wise Separable Attention for Multi-Event Video Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-tsattn/)

BibTeX

@inproceedings{zhang2026iclr-tsattn,
  title     = {{TS-Attn: Temporal-Wise Separable Attention for Multi-Event Video Generation}},
  author    = {Zhang, Hongyu and Deng, Yufan and Pan, Zilin and Jiang, Peng-Tao and Li, Bo and Hou, Qibin and Dong, Zhen and Dou, Zhiyang and Zhou, Daquan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-tsattn/}
}