Efficient Hardware Scaling and Diminishing Returns in Large-Scale Training of Language Models

Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

TMLR 2025

/tmlr/2025/fernandez2025tmlr-efficient/

Abstract

To train the exceedingly large neural networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model training. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies with current best practices. In experiments with model sizes up to 70B parameters and utilizing up to 2048 H100 GPUs, we demonstrate that: (1) Naive scale out with Fully Sharded Data Parallelism (FSDP) incurs communication overhead which leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

PDF TMLR Semantic Scholar

Cite

Text

Fernandez et al. "Efficient Hardware Scaling and Diminishing Returns in Large-Scale Training of Language Models." Transactions on Machine Learning Research, 2025.

Markdown

[Fernandez et al. "Efficient Hardware Scaling and Diminishing Returns in Large-Scale Training of Language Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/fernandez2025tmlr-efficient/)

BibTeX

@article{fernandez2025tmlr-efficient,
  title     = {{Efficient Hardware Scaling and Diminishing Returns in Large-Scale Training of Language Models}},
  author    = {Fernandez, Jared and Wehrstedt, Luca and Shamis, Leonid and Elhoushi, Mostafa and Saladi, Kalyan and Bisk, Yonatan and Strubell, Emma and Kahn, Jacob},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/fernandez2025tmlr-efficient/}
}