Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
these models to be surprisingly strong at performing deductive reasoning over formal logical theories expressed in natural language. A shortcoming of these studies, however, is that they do not take into account that logical theories, when sampled uniformly at random, do not necessarily lead to hard instances. We propose a new methodology for creating challenging algorithmic reasoning datasets that focus on natural language satisfiability (NLSat) problems.1 The key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language. This methodology allows us to distinguish easy from hard instances, and to systematically increase the complexity of existing reasoning benchmarks such as RuleTaker. We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems of substantially increased difficulty.
Specifically, given a set of systematically constructed natural language theories consisting of a set of explicitly stated rules and facts (e.g., the NL Theory in the bottom part of Figure 1 containing fictional rules about characters Bob and Alan), the goal is to see whether a model can learn to perform deductive reasoning over such theories by correctly answering queries that require making novel inferences (e.g., predicating that Alan is green is true based on knowing that Alan is rough and applying the rule All rough people are green).
must be demonstrated that the model can solve the hardest instances of the target problem.
Answering the hardness question, therefore, involves two additional questions: (Q1) is the formal language used to express the target problem space capable of expressing hard problems (e.g., ones that go beyond simple linear chaining)? (Q2) is the sampling method used to generate target instances able to effectively capture the full problem space?