Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Abstract

We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DALLE STREET a large-scale dataset generated by DALL-E 3 and validated by humans containing 9935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline CULTUREADAPT. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DALLE STREET and other existing benchmarks which we try to understand using over 18000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs highlighting the need to develop culture-aware systems.

Cite

Text

Mukherjee et al. "Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Mukherjee et al. "Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/mukherjee2025wacv-crossroads/)

BibTeX

@inproceedings{mukherjee2025wacv-crossroads,
  title     = {{Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models}},
  author    = {Mukherjee, Anjishnu and Zhu, Ziwei and Anastasopoulos, Antonios},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {1755-1764},
  url       = {https://mlanthology.org/wacv/2025/mukherjee2025wacv-crossroads/}
}