Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?

Abstract

Large Language Models (LLMs) are increasingly deployed in user-facing applications worldwide, necessitating the handling of multiple languages across a variety of tasks. However, there is no one metric that can predict a LLM’s multilingual capabilities. To address this gap, we propose Compression Parity (CP) – a metric based on Shannon’s information measure – to assess the multilingual capabilities of a LLM in a task-agnostic manner. We evaluate CP on open-sourced LLMs (Llama2, Gemma, Mistral) and demonstrate a strong correlation with existing task-specific metrics from the literature – better than any of the existing metrics we are aware of, e.g., tokenizer parity and fertility. These findings show that CP is a good predictor of an LLM’s performance in a certain language, hence it may serve as a useful tool for ranking multilingual LLMs’ capabilities regardless of the downstream task.

Cite

Text

Tsvetkov and Kipnis. "Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Tsvetkov and Kipnis. "Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/tsvetkov2024icmlw-multilingual/)

BibTeX

@inproceedings{tsvetkov2024icmlw-multilingual,
  title     = {{Multilingual Compression Parity: How Efficiently Large Language Models Represent Information Across Languages?}},
  author    = {Tsvetkov, Alexander and Kipnis, Alon},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/tsvetkov2024icmlw-multilingual/}
}