Wen, Kaiyue
16 publications
NeurIPS
2025
Gated Attention for Large Language Models: Non-Linearity, Sparsity, and Attention-Sink-Free
NeurIPSW
2024
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
NeurIPS
2023
Sharpness Minimization Algorithms Do Not Only Minimize Sharpness to Achieve Better Generalization