Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models
Abstract
We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DALLE STREET a large-scale dataset generated by DALL-E 3 and validated by humans containing 9935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline CULTUREADAPT. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DALLE STREET and other existing benchmarks which we try to understand using over 18000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs highlighting the need to develop culture-aware systems.
Cite
Text
Mukherjee et al. "Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Mukherjee et al. "Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/mukherjee2025wacv-crossroads/)BibTeX
@inproceedings{mukherjee2025wacv-crossroads,
title = {{Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models}},
author = {Mukherjee, Anjishnu and Zhu, Ziwei and Anastasopoulos, Antonios},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {1755-1764},
url = {https://mlanthology.org/wacv/2025/mukherjee2025wacv-crossroads/}
}