Can LLMs Follow Simple Rules?

Paper · arXiv 2311.04235 · Published November 6, 2023

Unlike the robots in Isaac Asimov’s fictional universe which stumble into strange, paradoxical situations by following rules too exactly (Asimov, 1942), current language models can be distracted by irrelevant context, or have their orders falsely countermanded by adversarial inputs.

...it is important for model behavior to be truly grounded in the provided rules rather than relying on spurious cues and merely appearing to do so. These properties will be important components in creating safe and trustworthy AI products.

Many rules we might want to impose on the behavior of language model assistants are simple in concept and easily expressed in natural language. For instance, a shoe store using a language model to answer customer support queries may want the model to decline questions unrelated to their products. One way of applying such rules to the model is to include them as part of model’s prompt and leverage the instruction-following capabilities of the model to condition subsequent outputs. Some models support system messages that are intended to support this purpose, though we find these only offer marginal improvements in reliability.

we introduce a new benchmark, Rule-following Language Evaluation Scenarios (RULES), to automatically evaluate how well a language model follows a variety of rules. In each scenario, the model is instructed to engage in role-playing activity while adhering to a set of rules. Analogous to Anthropic’s Harmless and Helpful framework (Bai et al., 2022b), our rules require the model to either refrain from or engage in behaviors, and can also be thought of as safety and liveness properties in computer systems (Lamport, 1977).

we find that the vast majority of models fail to follow the rules on a significant fraction of test cases.