Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs’ cultural competence, without accounting for how LLMs’ internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs’ cultural understanding. Our codes and data used for experiments are publicly available1.
Khan et al. (2025) found that if MCQs lack the adversarial depth to probe genuine cultural understanding, models can exploit surface-level elimination strategies without truly understanding cultural distinctions. Thus, we propose a cultural MCQ with hard negatives to study how overgeneralization—driven by regional or resource dominance or similarity—affects the downstream task. Since BLEnD (Myung et al., 2024) provides different answers from each culture with the same question, we create BLEnD-resource and BLEnDregion partition using culturally nuanced answers in BLEnD.
With our Culturescope method, we can now probe the cultural knowledge encoded within the internal representations of LLMs. As shown in Figure 3, which visualizes the cultural flattening direction between cultures, we find unidirectional connections that have Iran and the United States as ys. This unidirectional connection implies that models may have learned to represent less documented cultures, such as Ethiopia and Algeria, through those high-resource cultures. Our experiments using hard negative options align with previous works, which find that LLMs sometimes respond with answers aligned with culturally similar or geographically proximate regions (Cao et al., 2023; Tao et al., 2024). We further attribute the models’ tendency to favor culturally adjacent answers to the unidirectional connections found by the proposed Culturescope. These findings underscore the need for methods that can disentangle culturally entangled representations, particularly among similar cultures, to enhance the accuracy and cultural appropriateness of LLM outputs.