Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies
Abstract
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. For example, suggesting clocks as gifts for a baby’s birthday in China may invoke associations with death, leading to user discomfort and undermining trust. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains (i.e., shopping, meal planning, and outdoor activities), and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models (e.g., Llama-4-Maverick) and reasoning models (e.g., o1 and Gemini-2.5-Pro). Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models achieve performance better or comparable to GPT-4o, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o’s cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks. This work establishes a framework for evaluating and improving cultural safety in vision-language systems across diverse global contexts.
Cite
Text
Qiu et al. "Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies." Transactions on Machine Learning Research, 2025.Markdown
[Qiu et al. "Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/qiu2025tmlr-multimodal/)BibTeX
@article{qiu2025tmlr-multimodal,
title = {{Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies}},
author = {Qiu, Haoyi and Huang, Kung-Hsiang and Zheng, Ruichen and Sun, Jiao and Peng, Nanyun},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/qiu2025tmlr-multimodal/}
}