Low-Width Approximations and Sparsification for Scaling Graph Transformers

Abstract

Graph Transformers have shown excellent results on a diverse set of datasets. However, memory limitations prohibit these models from scaling to larger graphs. With standard single-GPU setups, even training on medium-sized graphs is impossible for most Graph Transformers. While the $\mathcal{O}(nd^2+n^2d)$ complexity of each layer can be reduced to $\mathcal{O}((n+m)d+nd^2)$ using sparse attention models such as Exphormer for graphs with $n$ nodes and $m$ edges, these models are still infeasible to train on training on small-memory devices even for medium-sized datasets. Here, we propose to sparsify the Exphormer model even further, by using a small ``pilot'' network to estimate attention scores along the graph edges, then training a larger model only using $\mathcal O(n)$ edges deemed important by the small network. We show empirically that attention scores from smaller networks provide a good estimate of the attention scores in larger networks, and that this process can yield a large-width sparse model nearly as good as the large-width non-sparse model.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Shirzad et al. "Low-Width Approximations and Sparsification for Scaling Graph Transformers." NeurIPS 2023 Workshops: GLFrontiers, 2023.

Markdown

[Shirzad et al. "Low-Width Approximations and Sparsification for Scaling Graph Transformers." NeurIPS 2023 Workshops: GLFrontiers, 2023.](https://mlanthology.org/neuripsw/2023/shirzad2023neuripsw-lowwidth/)

BibTeX

@inproceedings{shirzad2023neuripsw-lowwidth,
  title     = {{Low-Width Approximations and Sparsification for Scaling Graph Transformers}},
  author    = {Shirzad, Hamed and Venkatachalam, Balaji and Velingker, Ameya and Sutherland, Danica and Woodruff, David},
  booktitle = {NeurIPS 2023 Workshops: GLFrontiers},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/shirzad2023neuripsw-lowwidth/}
}