QiMeng-TensorOp: One-Line Prompt Is Enough for High-Performance Tensor Operator Generation with Hardware Primitives
Abstract
Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks. Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability. LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291× performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251% of OpenBLAS on RISC-V CPUs, and 124% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200× compared with human experts.
Cite
Text
Zhang et al. "QiMeng-TensorOp: One-Line Prompt Is Enough for High-Performance Tensor Operator Generation with Hardware Primitives." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/783Markdown
[Zhang et al. "QiMeng-TensorOp: One-Line Prompt Is Enough for High-Performance Tensor Operator Generation with Hardware Primitives." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/zhang2025ijcai-qimeng/) doi:10.24963/IJCAI.2025/783BibTeX
@inproceedings{zhang2025ijcai-qimeng,
title = {{QiMeng-TensorOp: One-Line Prompt Is Enough for High-Performance Tensor Operator Generation with Hardware Primitives}},
author = {Zhang, Xuzhi and Peng, Shaohui and Zhou, Qirui and Wen, Yuanbo and Guo, Qi and Chen, Ruizhi and Zhu, Xinguo and Xiong, Weiqiang and Chen, Haixin and Ma, Congying and Gao, Ke and Zhao, Chen and Wu, Yanjun and Chen, Yunji and Li, Ling},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {7038-7046},
doi = {10.24963/IJCAI.2025/783},
url = {https://mlanthology.org/ijcai/2025/zhang2025ijcai-qimeng/}
}