Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.

Abstract

The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a \emph{numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system}. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Inspired from such a relationship, we propose to replace the Lie-Trotter splitting scheme by the more accurate Strang-Marchuk splitting scheme and design a new network architecture called Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks.

Cite

Text

Lu et al. "Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.." ICLR 2020 Workshops: DeepDiffEq, 2020.

Markdown

[Lu et al. "Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.." ICLR 2020 Workshops: DeepDiffEq, 2020.](https://mlanthology.org/iclrw/2020/lu2020iclrw-understanding/)

BibTeX

@inproceedings{lu2020iclrw-understanding,
  title     = {{Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.}},
  author    = {Lu, Yiping and Li, Zhuohan and He, Di and Sun, Zhiqing and Dong, Bin and Qin, Tao and Wang, Liwei and Liu, Tie-yan},
  booktitle = {ICLR 2020 Workshops: DeepDiffEq},
  year      = {2020},
  url       = {https://mlanthology.org/iclrw/2020/lu2020iclrw-understanding/}
}