NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming

Abstract

Exploiting the formidable computational capabilities of modern GPU tensor cores remains a challenging endeavor for developers. Existing programming models like CUDA and OpenCL are ill-suited for the non-SIMT nature of tensor cores, leaving a significant gap in the landscape of GPU programming languages. Vendors have primarily relied on library-based solutions or enhancements to mainstream machine learning frameworks, sacrificing the fine-grained control once afforded by CUDA in the SIMT era. In this paper, we introduce NVDSL, a Python-embedded domain-specific language that is based on MLIR compiler. NVDSL abstracts away the intricate details of tensor core programming. It allows programmers to efficiently program Hopper's Warpgroup (128 threads or 4 warps), enabling users to express sophisticated algorithms, such as multistage and warp specialization, with remarkable simplicity. We demonstrate its efficacy through two optimized GEMM kernels that achieve cuBLAS-like performance with remarkable code clarity. It is publicly available in upstream MLIR. The work is presented in EuroLLVM24 https://www.youtube.com/watch?v=V3Q9IjsgXvA.

Cite

Text

Ozen. "NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Ozen. "NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/ozen2024icmlw-nvdsl/)

BibTeX

@inproceedings{ozen2024icmlw-nvdsl,
  title     = {{NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming}},
  author    = {Ozen, Guray},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/ozen2024icmlw-nvdsl/}
}