CoTFormer: More Tokens with Attention Make up for Less Depth
Abstract
The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this study, we establish an approximate parallel between the utilization of the chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve comparable performance to that of a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers.
Cite
Text
Mohtashami et al. "CoTFormer: More Tokens with Attention Make up for Less Depth." NeurIPS 2023 Workshops: WANT, 2023.Markdown
[Mohtashami et al. "CoTFormer: More Tokens with Attention Make up for Less Depth." NeurIPS 2023 Workshops: WANT, 2023.](https://mlanthology.org/neuripsw/2023/mohtashami2023neuripsw-cotformer/)BibTeX
@inproceedings{mohtashami2023neuripsw-cotformer,
title = {{CoTFormer: More Tokens with Attention Make up for Less Depth}},
author = {Mohtashami, Amirkeivan and Pagliardini, Matteo and Jaggi, Martin},
booktitle = {NeurIPS 2023 Workshops: WANT},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/mohtashami2023neuripsw-cotformer/}
}