CVQA: Culturally-Diverse Multilingual Visual Question Answering Benchmark
Abstract
Visual Question Answering~(VQA) is an important task in multimodal AI, which requires models to understand and reason on knowledge present in visual and textual data. However, most of the current VQA datasets and models are primarily focused on English and a few major world languages, with images that are Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, some datasets extend the text to other languages, either via translation or some other approaches, but usually keep the same images, resulting in narrow cultural representation. To address these limitations, we create CVQA, a new Culturally-diverse Multilingual Visual Question Answering benchmark dataset, designed to cover a rich set of languages and regions, where we engage native speakers and cultural experts in the data collection process. CVQA includes culturally-driven images and questions from across 28 countries in four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and we show that the dataset is challenging for the current state-of-the-art models. This benchmark will serve as a probing evaluation suite for assessing the cultural bias of multimodal models and hopefully encourage more research efforts towards increasing cultural awareness and linguistic diversity in this field.
Cite
Text
Romero et al. "CVQA: Culturally-Diverse Multilingual Visual Question Answering Benchmark." Neural Information Processing Systems, 2024. doi:10.52202/079017-0366Markdown
[Romero et al. "CVQA: Culturally-Diverse Multilingual Visual Question Answering Benchmark." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/romero2024neurips-cvqa/) doi:10.52202/079017-0366BibTeX
@inproceedings{romero2024neurips-cvqa,
title = {{CVQA: Culturally-Diverse Multilingual Visual Question Answering Benchmark}},
author = {Romero, David and Lyu, Chenyang and Wibowo, Haryo Akbarianto and Lynn, Teresa and Hamed, Injy and Kishore, Aditya Nanda and Mandal, Aishik and Dragonetti, Alina and Abzaliev, Artem and Tonja, Atnafu Lambebo and Balcha, Bontu Fufa and Whitehouse, Chenxi and Salamea, Christian and Velasco, Dan John and Adelani, David Ifeoluwa and Le Meur, David and Villa-Cueva, Emilio and Koto, Fajri and Farooqui, Fauzan and Belcavello, Frederico and Batnasan, Ganzorig and Vallejo, Gisela and Caulfield, Grainne and Ivetta, Guido and Song, Haiyue and Ademtew, Henok Biadglign and Maina, Hernán and Lovenia, Holy and Azime, Israel Abebe and Cruz, Jan Christian Blaise and Gala, Jay and Geng, Jiahui and Ortiz-Barajas, Jesus-German and Baek, Jinheon and Dunstan, Jocelyn and Alemany, Laura Alonso and Nagasinghe, Kumaranage Ravindu Yasas and Benotti, Luciana and D'Haro, Luis Fernando and Viridiano, Marcelo and Estecha-Garitagoitia, Marcos and Cabrera, Maria Camila Buitrago and Rodríguez-Cantelar, Mario and Jouitteau, Mélanie and Mihaylov, Mihail and Etori, Naome and Imam, Mohamed Fazli Mohamed and Adilazuarda, Muhammad Farid and Gochoo, Munkhjargal and Otgonbold, Munkh-Erdene and Niyomugisha, Olivier and Silva, Paula Mónica and Chitale, Pranjal and Dabre, Raj and Chevi, Rendi and Zhang, Ruochen and Diandaru, Ryandito and Cahyawijaya, Samuel and Góngora, Santiago and Jeong, Soyeong and Purkayastha, Sukannya and Kuribayashi, Tatsuki and Clifford, Teresa and Jayakumar, Thanmay and Torrent, Tiago Timponi and Ehsan, Toqeer and Araujo, Vladimir and Kementchedjhieva, Yova and Burzo, Zara and Lim, Zheng Wei and Yong, Zheng Xin and Ignat, Oana and Nwatu, Joan and Mihalcea, Rada and Solorio, Thamar and Aji, Alham Fikri},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-0366},
url = {https://mlanthology.org/neurips/2024/romero2024neurips-cvqa/}
}