BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Myung, Junho; Lee, Nayeon; Zhou, Yi; Jin, Jiho; Putri, Rifki Afina; Antypas, Dimosthenis; Borkakoty, Hsuvas; Kim, Eunsu; Perez-Almendros, Carla; Ayele, Abinew Ali; Gutiérrez-Basulto, Víctor; Ibáñez-García, Yazmín; Lee, Hwaran; Muhammad, Shamsuddeen Hassan; Park, Kiwoong; Rzayev, Anar Sabuhi; White, Nina; Yimam, Seid Muhie; Pilehvar, Mohammad Taher; Ousidhoum, Nedjma; Camacho-Collados, Jose; Oh, Alice

doi:10.52202/079017-2483

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

NeurIPS 2024

doi:10.52202/079017-2483 /neurips/2024/myung2024neurips-blend/

Abstract

Large language models (LLMs) often lack culture-specific everyday knowledge, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are usually limited to a single language or online sources like Wikipedia, which may not reflect the daily habits, customs, and lifestyles of different regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play or the sports they practice in school is not always explicitly written online. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. The benchmark comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We evaluate LLMs in two formats: short-answer questions, and multiple-choice questions. We show that LLMs perform better in cultures that are more present online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. Furthermore, we find that LLMs perform better in their local languages for mid-to-high-resource languages. Interestingly, for languages deemed to be low-resource, LLMs provide better answers in English. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Myung et al. "BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages." Neural Information Processing Systems, 2024. doi:10.52202/079017-2483

Markdown

[Myung et al. "BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/myung2024neurips-blend/) doi:10.52202/079017-2483

BibTeX

@inproceedings{myung2024neurips-blend,
  title     = {{BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages}},
  author    = {Myung, Junho and Lee, Nayeon and Zhou, Yi and Jin, Jiho and Putri, Rifki Afina and Antypas, Dimosthenis and Borkakoty, Hsuvas and Kim, Eunsu and Perez-Almendros, Carla and Ayele, Abinew Ali and Gutiérrez-Basulto, Víctor and Ibáñez-García, Yazmín and Lee, Hwaran and Muhammad, Shamsuddeen Hassan and Park, Kiwoong and Rzayev, Anar Sabuhi and White, Nina and Yimam, Seid Muhie and Pilehvar, Mohammad Taher and Ousidhoum, Nedjma and Camacho-Collados, Jose and Oh, Alice},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2483},
  url       = {https://mlanthology.org/neurips/2024/myung2024neurips-blend/}
}