Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?
Abstract
Large Language Models (LLMs) are increasingly deployed in user-facing applications worldwide, necessitating the handling of multiple languages across a variety of tasks. However, there is no one metric that can predict a LLM’s multilingual capabilities. To address this gap, we propose Compression Parity (CP) – a metric based on Shannon’s information measure – to assess the multilingual capabilities of a LLM in a task-agnostic manner. We evaluate CP on open-sourced LLMs (Llama2, Gemma, Mistral) and demonstrate a strong correlation with existing task-specific metrics from the literature – better than any of the existing metrics we are aware of, e.g., tokenizer parity and fertility. These findings show that CP is a good predictor of an LLM’s performance in a certain language, hence it may serve as a useful tool for ranking multilingual LLMs’ capabilities regardless of the downstream task.
Cite
Text
Tsvetkov and Kipnis. "Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?." ICML 2024 Workshops: TF2M, 2024.Markdown
[Tsvetkov and Kipnis. "Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/tsvetkov2024icmlw-multilingual/)BibTeX
@inproceedings{tsvetkov2024icmlw-multilingual,
title = {{Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?}},
author = {Tsvetkov, Alexander and Kipnis, Alon},
booktitle = {ICML 2024 Workshops: TF2M},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/tsvetkov2024icmlw-multilingual/}
}