Suzuki, Taiji
142 publications
NeurIPS
2025
Degrees of Freedom for Linear Attention: Distilling SoftMax Attention with Optimal Feature Efficiency
NeurIPS
2025
From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers
NeurIPS
2025
Generalization Bound of Gradient Flow Through Training Trajectory and Data-Dependent Kernel
ICML
2025
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation
ICML
2025
Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning
ICLR
2025
On the Optimization and Generalization of Two-Layer Transformers with Sign Gradient Descent
ICML
2025
Propagation of Chaos for Mean-Field Langevin Dynamics and Its Application to Model Ensemble
AISTATS
2025
Quantifying the Optimization and Generalization Advantages of Graph Neural Networks over Multilayer Perceptrons
ICLR
2025
Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
ICML
2024
High-Dimensional Kernel Methods Under Covariate Shift: Data-Dependent Implicit Regularization
ICML
2024
Mean Field Langevin Actor-Critic: Faster Convergence and Global Optimality Beyond Lazy Learning
ICMLW
2024
Neural Network Learns Low-Dimensional Polynomials with SGD near the Information-Theoretic Limit
NeurIPS
2024
Neural Network Learns Low-Dimensional Polynomials with SGD near the Information-Theoretic Limit
NeurIPSW
2024
Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression
NeurIPS
2024
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning
ICMLW
2024
State Space Models Are Comparable to Transformers in Estimating Functions with Dynamic Smoothness
ICLR
2024
Understanding Convergence and Generalization in Federated Learning Through Feature Learning Theory
NeurIPS
2024
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization
NeurIPS
2023
Convergence of Mean-Field Langevin Dynamics: Time-Space Discretization, Stochastic Gradient, and Variance Reduction
ICML
2023
DIFF2: Differential Private Optimization via Gradient Differences for Nonconvex Distributed Learning
NeurIPS
2023
Feature Learning via Mean-Field Langevin Dynamics: Classifying Sparse Parities and Beyond
NeurIPSW
2023
Graph Neural Networks Benefit from Structural Information Provably: A Feature Learning Perspective
NeurIPS
2023
Learning in the Presence of Low-Dimensional Structure: A Spiked Random Matrix Perspective
NeurIPS
2022
High-Dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation
NeurIPSW
2022
Reducing Communication in Nonconvex Federated Learning with a Novel Single-Loop Variance Reduction Method
NeurIPS
2022
Two-Layer Neural Network on Infinite Dimensional Data: Global Optimization Guarantee in the Mean-Field Regime
AISTATS
2021
Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features
NeurIPS
2021
Deep Learning Is Adaptive to Intrinsic Dimensionality of Model Smoothness in Anisotropic Besov Space
ICML
2021
On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting
NeurIPS
2021
Particle Dual Averaging: Optimization of Mean Field Neural Network with Global Convergence Rate Analysis
AISTATS
2020
Functional Gradient Boosting for Learning Residual-like Networks with Statistical Guarantees
AISTATS
2019
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
AISTATS
2018
Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models
AISTATS
2018
Independently Interpretable Lasso: A New Regularizer for Sparse Regression with Uncorrelated Variables
NeurIPS
2017
Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization
AISTATS
2017
Stochastic Difference of Convex Algorithm and Its Application to Training Deep Boltzmann Machines
MLJ
2013
Computational Complexity of Kernel-Based Density-Ratio Estimation: A Condition Number Analysis
ICML
2013
Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method
COLT
2012
A Conjugate Property Between Loss Functions and Uncertainty Sets in Classification Problems
AISTATS
2012
Fast Learning Rate of Multiple Kernel Learning: Trade-Off Between Sparsity and Smoothness