NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming
Abstract
Exploiting the formidable computational capabilities of modern GPU tensor cores remains a challenging endeavor for developers. Existing programming models like CUDA and OpenCL are ill-suited for the non-SIMT nature of tensor cores, leaving a significant gap in the landscape of GPU programming languages. Vendors have primarily relied on library-based solutions or enhancements to mainstream machine learning frameworks, sacrificing the fine-grained control once afforded by CUDA in the SIMT era. In this paper, we introduce NVDSL, a Python-embedded domain-specific language that is based on MLIR compiler. NVDSL abstracts away the intricate details of tensor core programming. It allows programmers to efficiently program Hopper's Warpgroup (128 threads or 4 warps), enabling users to express sophisticated algorithms, such as multistage and warp specialization, with remarkable simplicity. We demonstrate its efficacy through two optimized GEMM kernels that achieve cuBLAS-like performance with remarkable code clarity. It is publicly available in upstream MLIR. The work is presented in EuroLLVM24 https://www.youtube.com/watch?v=V3Q9IjsgXvA.
Cite
Text
Ozen. "NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming." ICML 2024 Workshops: ES-FoMo-II, 2024.Markdown
[Ozen. "NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/ozen2024icmlw-nvdsl/)BibTeX
@inproceedings{ozen2024icmlw-nvdsl,
title = {{NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming}},
author = {Ozen, Guray},
booktitle = {ICML 2024 Workshops: ES-FoMo-II},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/ozen2024icmlw-nvdsl/}
}