Logical Reasoning in Large Language Models: A Survey

Paper · arXiv 2502.09100 · Published February 13, 2025
Reasoning Logic Internal RulesReasoning ArchitecturesEvaluations

With the emergence of advanced reasoning models like OpenAI o3 and DeepSeek-R1, large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, their ability to perform rigorous logical reasoning remains an open question. This survey synthesizes recent advancements in logical reasoning within LLMs, a critical area of AI research. It outlines the scope of logical reasoning in LLMs, its theoretical foundations, and the benchmarks used to evaluate reasoning proficiency. We analyze existing capabilities across different reasoning paradigms — deductive, inductive, abductive, and analogical—and assess strategies to enhance reasoning performance, including data-centric tuning, reinforcement learning, decoding strategies, and neuro-symbolic approaches.

2.2 Types of Logical Reasoning Logical reasoning can be broadly categorized into four main types, each serving distinct purposes and applications:

Deductive Reasoning. This type of reasoning derives specific conclusions from general principles or premises. It operates under the rule that if all premises are true and the reasoning is valid, the conclusion must also be true. For example, given the premises “All apples are red” and “This fruit is an apple” one can deduce that “This fruit is red” Deductive reasoning is fundamental in fields such as mathematics and formal logic, where certainty and rigor are paramount.

Inductive Reasoning. Unlike deductive reasoning, inductive reasoning draws general conclusions based on specific observations or evidence. While the conclusions are often considered probable, they are not guaranteed to be true. For instance, observing that all swans seen so far are white might lead to the inductive conclusion that “All swans are white” Inductive reasoning is widely used in scientific discovery and data-driven decision-making, where patterns and trends are inferred from empirical data.

Abductive Reasoning. This form of reasoning seeks the most plausible explanation or cause for a set of observations, often in the presence of incomplete information. Abductive reasoning is particularly useful in diagnostic tasks and realworld problem-solving. For example, seeing wet spots on the street might lead one to infer that “It has recently rained” While abductive conclusions are not certain, they provide a practical basis for hypothesis generation and decision-making under uncertainty.

Analogical Reasoning. Analogical reasoning involves drawing comparisons between similar situations or domains to make inferences or solve problems. By identifying parallels between different scenarios, this type of reasoning enables creative problem-solving and knowledge transfer. For example, understanding that planets orbit the sun in elliptical paths might lead one to analogically reason that other celestial bodies, such as comets, exhibit similar orbital characteristics. Analogical reasoning is particularly valuable in fields like education, design, and innovation.

Instruction Fine-Tuning

Instruction Fine-Tuning (IFT) adapts LLMs through supervised learning on task-specific instructions. For example, Liu et al. [2023c] design multi-grained instructions spanning diverse levels of abstraction and complexity. Similarly, Feng et al. [2024] IFT models to mimic logical solvers by replicating formal deduction reasoning processes. In addition, Xu et al. [2024a] implement two-stage symbolic fine-tuning through Injection (injecting symbolic knowledge) and Infusion (balancing symbol and NL reasoning). To overcome IFT’s over-fitting limitations, Wang et al. [2024b] enforce contrastive learning between factual/ counterfactual paths with IFT. Further, Wang et al. [2024a] augment Llamas with a Program-Guided Learning Framework and logic-specific architecture adjustments. Recently, Muennighoff et al. [2025] propose s1, achieving test-time scaling through IFT on 1,000 meticulously crafted long CoT samples. Combined with budget-forcing technique, it significantly enhances the reasoning capability of a Qwen2.5-32B-Instruct model, allowing extrapolating beyond its performance without test-time intervention.

Reinforcement Learning

Reinforcement learning (RL) has become pivotal in optimizing large language models (LLMs), particularly since the breakthrough of Reinforcement Learning from Human Feedback (RLHF). Jiao et al. [2024] leverage RL for planningbased reasoning optimization, while Xi et al. [2024] develop R3, achieving process supervision benefits through outcomeonly supervision.

The success of large-scale RL in OpenAI-o1 [OpenAI, 2024] has inspired numerous studies. RL algorithms train o1- style models to enhance Chain-of-Thought (CoT) reasoning, addressing issues like formulaic outputs and limited longform reasoning. For instance, Zhao et al. [2024] integrate CoT instruction fine-tuning with Monte Carlo Tree Search (MCTS) decoding for multi-path reasoning exploration. In contrast, Zhang et al. [2024] employ MCTS to generate codereasoning data for instruction fine-tuning (IFT) and Direct Preference Optimization (DPO).

A significant breakthrough comes from DeepSeek- R1 [DeepSeek-AI, 2025], which pioneers a novel RL strategy to enhance logical reasoning. DeepSeek-R1-Zero, trained purely through RL without IFT, demonstrates impressive reasoning capabilities but faces challenges in readability and language consistency. To address this, DeepSeek-R1 introduces minimal long-CoT IFT data as a cold start before RL, achieving a balance between usability and reasoning performance. By iteratively synthesizing high-quality reasoning data through RL, DeepSeek-R1 overcomes limitations imposed by human annotators, addressing issues such as mechanistic responses, repetitive patterns, and insufficient long-chain reasoning. This approach represents a potential paradigm shift in logical reasoning optimization, pushing the boundaries of what LLMs can achieve in structured reasoning tasks.

Inference-Time Decoding

We categorize logical reasoning enhancement methods during inference-time into inference-time scaling and constrained decoding. Inference-time scaling employs computational augmentation without parameter updates. One common approach is decoding with structured outputs and modular workflows. GoT [Lei et al., 2023] creates structured reasoning nodes to improve complex multi-step logical reasoning. Similarly, Chain of Logic [Servantez et al., 2024] introduces a Decomposition-Recomposition structure for legal reasoning. In other contexts, researchers design more complex modular workflows for better performance [Creswell et al., 2023; Malon et al., 2024]. Another inference-time scaling approach involves stimulating autonomous reasoning, guiding LLMs to iteratively refine their answers. Maieutic Prompting [Jung et al., 2022] eliminates contradictions through recursive reasoning. Similarly, Logic-of-Thoughts [Liu et al., 2024a] and DetermLR [Sun et al., 2024] progressively approach the answers in an iterative style. Constrained decoding methods, on the other hand, focus on improving the controllability and reliability of reasoning processes. Neurologic [Lu et al., 2021] enforces predicate logic constraints, while Formal-LLM [Li et al., 2024b] integrates automata for constraining plan generation. 5.4 Neuro-Symbolic Approaches Neural-symbolic hybrid methods represent a burgeoning research area that aims to combine the powerful representational capabilities of deep learning with the precision and interpretability of symbolic reasoning. Formally, a neural-symbolic hybrid system aims to optimize both the neural model M and the symbolic solver P (where P represents the symbolic reasoning process) to maximize logical reasoning performance. The overall objective can be expressed as: (M∗, P∗) = arg max M,P R(P(M(x))), where: • M: The neural model, which includes both the model’s parameters and its decoding strategies. It maps the input x (e.g., natural language) into a symbolic representation z within a formal language L: z = M(x), z ∈ L. • P: The symbolic solver, which operates on the symbolic representation z produced by M to generate the final output y: y = P(z). • R: The reasoning performance metric, which evaluates the ability to perform logical reasoning tasks. The optimization process involves two key directions: • Improving M: including refining the model’s parameters and decoding strategies to produce symbolic representations that are both accurate and compatible with P. • Enhancing P: involving improving the symbolic solver’s ability to process.

By jointly optimizing M and P, neural-symbolic hybrid systems aim to leverage the strengths of both neural networks and symbolic reasoning to achieve superior logical reasoning capabilities. It is worth noting that in earlier neural-symbolic pipelines, P is often implemented as a fixed external logical reasoning engine, and thus is generally not optimized. However, in advanced practice, LLMs are increasingly being used to perform the role of P, enabling diverse optimization

Robustness vs. Generalization. LLMs exhibit inconsistent performance in structured reasoning tasks such as deductive inference and abductive hypothesis generation. While models fine-tuned on datasets like FOLIO [Han et al., 2024a] excel in controlled settings, they struggle with adversarial perturbations or semantically equivalent rephrasings. This inconsistency arises from their reliance on surface-level statistical correlations rather than causal relationships, coupled with limited out-of-distribution generalization. A key question persists: can LLMs achieve human-like robustness without sacrificing cross-domain adaptability? Current methods prioritize narrow task performance, leaving real-world applicability uncertain.

Interpretability vs. Performance. A central tension lies in balancing neural scalability with symbolic precision. Neurosymbolic approaches like Logic-LM [Pan et al., 2023] and Symbol-LLM [Xu et al., 2024a] embed formal logic solvers into neural architectures, improving interpretability through step-by-step proofs. However, these methods face scalability bottlenecks with large knowledge bases or complex rule dependencies. Conversely, data-driven methods (e.g., instruction tuning on LogicBench [Parmar et al., 2024]) achieve broader task coverage but fail to generalize beyond syntactic patterns. How can we reconcile transparent reasoning with black-box model performance? Hybrid architectures offer promise but introduce computational overhead, limiting practical deployment.