LoRD: Low-Rank Decomposition of Monolingual Code LLMs for One-Shot Compression
Abstract
We propose using low-rank matrix decomposition (LoRD), which splits a large matrix into a product of two smaller matrices, to compress neural network models and thereby enhance inference speed. Unlike quantization, LoRD maintains fully differentiable, trainable parameters and leverages efficient floating-point operations. We investigate its advantages for compressing Large Language Models (LLMs) for monolingual code generation, demonstrating that linear layer ranks can be reduced by up to 39.58% with less than a 1% increase in perplexity. Specifically, we use LoRD to compress the StarCoder 16B model to 13.2B parameters with no performance drop and to 12.3B parameters with minimal performance drop in the HumanEval Pass@1 score, all within 10 minutes on a single A100 GPU. The compressed models achieve up to a 22.35% inference speedup with just a single line of code change in HuggingFace’s implementation with Pytorch backend.
Cite
Text
Kaushal et al. "LoRD: Low-Rank Decomposition of Monolingual Code LLMs for One-Shot Compression." ICML 2024 Workshops: FM-Wild, 2024.Markdown
[Kaushal et al. "LoRD: Low-Rank Decomposition of Monolingual Code LLMs for One-Shot Compression." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/kaushal2024icmlw-lord/)BibTeX
@inproceedings{kaushal2024icmlw-lord,
title = {{LoRD: Low-Rank Decomposition of Monolingual Code LLMs for One-Shot Compression}},
author = {Kaushal, Ayush and Vaidhya, Tejas and Rish, Irina},
booktitle = {ICML 2024 Workshops: FM-Wild},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/kaushal2024icmlw-lord/}
}