CoTFormer: More Tokens with Attention Make up for Less Depth

Abstract

The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this study, we establish an approximate parallel between the utilization of the chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve comparable performance to that of a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers.

Cite

Text

Mohtashami et al. "CoTFormer: More Tokens with Attention Make up for Less Depth." NeurIPS 2023 Workshops: WANT, 2023.

Markdown

[Mohtashami et al. "CoTFormer: More Tokens with Attention Make up for Less Depth." NeurIPS 2023 Workshops: WANT, 2023.](https://mlanthology.org/neuripsw/2023/mohtashami2023neuripsw-cotformer/)

BibTeX

@inproceedings{mohtashami2023neuripsw-cotformer,
  title     = {{CoTFormer: More Tokens with Attention Make up for Less Depth}},
  author    = {Mohtashami, Amirkeivan and Pagliardini, Matteo and Jaggi, Martin},
  booktitle = {NeurIPS 2023 Workshops: WANT},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/mohtashami2023neuripsw-cotformer/}
}