TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads

Abstract

TensorHive is a tool for organizing work of research and engineering teams that use servers with GPUs for machine learning workloads. In a comprehensive web interface, it supports reservation of GPUs for exclusive usage, hardware monitoring, as well as configuring, executing and queuing distributed computational jobs. Focusing on easy installation and simple configuration, the tool automatically detects the available computing resources and monitors their utilization. Reservations granted on the basis of flexible access control settings are protected by pluggable violation hooks. The job execution module includes auto-configuration templates for distributed neural network training jobs in frameworks such as TensorFlow and PyTorch. Documentation, source code, usage examples and issue tracking are available at the project page: https://github.com/roscisz/TensorHive/

Cite

Text

Rościszewski et al. "TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads." Machine Learning Open Source Software, 2021.

Markdown

[Rościszewski et al. "TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads." Machine Learning Open Source Software, 2021.](https://mlanthology.org/mloss/2021/rosciszewski2021jmlr-tensorhive/)

BibTeX

@article{rosciszewski2021jmlr-tensorhive,
  title     = {{TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads}},
  author    = {Rościszewski, Paweł and Martyniak, Michał and Schodowski, Filip},
  journal   = {Machine Learning Open Source Software},
  year      = {2021},
  pages     = {1-5},
  volume    = {22},
  url       = {https://mlanthology.org/mloss/2021/rosciszewski2021jmlr-tensorhive/}
}