Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up

Paper · arXiv 2410.12323 · Published October 16, 2024

we propose Reversal of Thought (RoT), a novel framework aimed at enhancing the logical reasoning abilities of LLMs. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs’ cognitive preferences shaped by Reinforcement Learning with Human Feedback (RLHF). Through reverse reasoning, we ultilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs’ reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.

Recent studies have advanced CoT to guide LLMs, mainly through either multi-step prompting such as introducing planning-and-solve (Plaat et al., 2024; Yang et al., 2024a), self-consistency (Narang et al.) and recursive reasoning process (Lee and Kim, 2023; Yu et al., 2024) through Treeof- Thought (ToT) (Yao et al., 2024), Graph-of-Thought (GoT) (Besta et al., 2024), or multi-role (Zhang et al.; Suzgun and Kalai, 2024) to enhance logical capabilities and mitigate hallucination, yet this has stealthily increased the computational cost of inference or API access cost due to the multi-step inference. Buffer-of-Thought (BoT) (Yang et al., 2024a) attempts to reduce thinking steps by leveraging Retrieval-Augmented Generation (RAG) to retrieve gold thought templates from the buffer. However, it sacrifices flexibility due to the initialization of pre-set manual thought templates. Therefore, achieving accurate reasoning in LLMs while minimizing resource consumption remains a significant challenge.

In summary, existing methods primarily rely on multi-query CoT which injects knowledge (Suzgun and Kalai, 2024; Plaat et al., 2024) or data structure (Yao et al., 2024; Besta et al., 2024) to optimize decisions making, and encounter three significant limitations: (1) limitation in logical reasoning: Despite attempting different logic data structures (Yao et al., 2024; Besta et al., 2024; Yang et al., 2024a), an effective initiative Chain-of-Thought paradigm that suits and improves logical reasoning remains elusive (Bao et al., 2024); (2) unfaithfulness and cascaded errors: Single-step or multistep methods are liable to cause LLMs to output hallucinations, leading to cascading logic errors (Bao et al., 2024); (3) Trade-off between enhanced logic capabilities and resource consumption: Recent CoT advancements via multi-step or multi-role prompting increase costs and achieving a balance between logical flexibility, accuracy, and cost is of great significance for practical application.

To address above limitations, we propose Reversal of Thought (RoT), a novel framework that enables LLMs to explore cognitive preference on logical pseudocode solely using reverse prompting with given demos without additional task-related affirmations, as depicted in Figure 1. Our key contributions are as follows:

• To the best of our knowledge, we are the first to introduce a reversal reasoning for cognitive preference that enhances logical reasoning in LLMs by combining meta-cognitive with cognitive preference, resulting in a more cost-efficient and error-resilient framework for complex tasks.

• We propose a Preference-Guided Reverse Reasoning framework that enhances LLMs’ task cognition by employing a reverse reasoning warm-up strategy and preference-based self-evaluation to improve logical reasoning based on LLMs’ cognitive preferences.

• We introduce logical symbols in pseudocode for algorithmic planning and problem-solving, guided by meta-cognitive mechanisms, which improves the LLMs’ capacity for structured and accurate logical reasoning.

Chain-of-Thought (CoT) prompting (Wei et al., 2022) has been proven to be a promising approach that incorporates an intermediate logic chain to enhance LLMs’ logic. Recent studies primarily aimed at improving logical accuracy by introducing more external validation such as self-consistency (Narang et al.; Yu et al., 2024) or more hierarchical information such as Least-to-Most (Zhou et al.), Cumulative-Reasoning (Zhang et al.) and Multi-experts (Suzgun and Kalai, 2024) strategies, but faced challenges related to cumulative errors or poor flexibility. Additionally, numerous studies also proposed more standardized recursive or backtracking branch forms from the logical data structure, including Tree-of-Thought (ToT) (Yao et al., 2024), Graph-of-Thought (GoT) (Besta et al., 2024) and Buffer-of-Thought (BoT) (Yang et al., 2024a). However, an efficient logical reasoning method that strikes a balance among logical accuracy, flexibility, and cost has yet to be discovered.

Integrating knowledge boundary within LLMs has emerged as a prospective strategy for enhancing their ability to avoid reasoning hallucinations of unknown knowledge through knowledge boundary constraints which requires additional algorithmic efforts (Yin et al., 2024), external graph knowledge (Tian et al., 2024), and training consumption (Sun et al., 2024). Additionally, they focus on avoiding responses to unknown or incorrect prompts rather than proposing bold and proactive solutions to expand knowledge boundary in a heuristics without training. We proposed a prompt-based method utilizing LLMs pretrained knowledge boundary, inspired by meta cognition (Fleur et al., 2021) and cognitive preference for unknown knowledge (Uddin, 2021). Our method conducts reverse prompting on probing knowledge through demonstrations to obtain LLMs-taste problem cognitions, aggregates and distills original prompt into cognitive preference version.

Limitations Reversal of Thought (RoT) introduce a reverse reasoning warm-up to activate cognitive preference for LLMs to enhance logic capabilities and introduce a cognitive preference manager to determine knowledge boundary and utilize cognitive preference for known and unknown tasks. While RoT has performed exceptionally in logic accuracy and efficiency, We discuss major challenge in its reliance on two-shot demonstration inputs involving two distinct problem cases. We observed that RoT may struggles with one-shot learning in multi-source tasks. we partially and effectively mitigates this issue through the integration of Cognitive Preference Manager (CPM) and two-shot learning. In future work, we aim to extend RoT’s capabilities by incorporating In-Context Learning (ICL), which will allow for greater flexibility in adapting to varied contexts and improve its performance on more complex reasoning tasks.