ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context

Paper · arXiv 2407.06866 · Published July 9, 2024

While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology.

we instead focus on a previously unexplored factor in unequal capabilities: chatbot guardrails, the restrictions that limit model responses to uncertain or sensitive questions and often provide boilerplate text refusing to fulfill a request (see Fig. 1).