Wen, Kaiyue

16 publications

ICLR 2025 From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang
NeurIPS 2025 Gated Attention for Large Language Models: Non-Linearity, Sparsity, and Attention-Sink-Free Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
ICML 2025 Overtrained Language Models Are Harder to Fine-Tune Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan
ICLRW 2025 Overtrained Language Models Are Harder to Fine-Tune Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan
NeurIPS 2025 PaTH Attention: Position Encoding via Accumulating Householder Transformations Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
ICLR 2025 RNNs Are Not Transformers (Yet): The Key Bottleneck on In-Context Retrieval Kaiyue Wen, Xingyu Dang, Kaifeng Lyu
ICML 2025 Task Generalization with Autoregressive Compositional Structure: Can Learning from $d$ Tasks Generalize to $D^T$ Tasks? Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hongzhou Lin, Jingzhao Zhang, Mikhail Belkin
ICLR 2025 Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View Kaiyue Wen, Zhiyuan Li, Jason S. Wang, David Leo Wright Hall, Percy Liang, Tengyu Ma
NeurIPSW 2024 From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang
ICLRW 2024 On the Representation Gap Between Modern RNNs and Transformers: The Curse of Memory Efficiency and the Fix of In-Context Retrieval Kaiyue Wen, Xingyu Dang, Kaifeng Lyu
ICMLW 2023 (Un)interpretability of Transformers: A Case Study with Dyck Grammars Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski
ICLR 2023 Benign Overfitting in Classification: Provably Counter Label Noise with Larger Models Kaiyue Wen, Jiaye Teng, Jingzhao Zhang
ICLR 2023 How Sharpness-Aware Minimization Minimizes Sharpness? Kaiyue Wen, Tengyu Ma, Zhiyuan Li
NeurIPS 2023 Sharpness Minimization Algorithms Do Not Only Minimize Sharpness to Achieve Better Generalization Kaiyue Wen, Zhiyuan Li, Tengyu Ma
NeurIPS 2023 Transformers Are Uninterpretable with Myopic Methods: A Case Study with Bounded Dyck Grammars Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski
NeurIPSW 2022 How Does Sharpness-Aware Minimization Minimizes Sharpness? Kaiyue Wen, Tengyu Ma, Zhiyuan Li