Can Large Language Models Reason and Optimize Under Constraints?
Large Language Models (LLMs) have achieved notable performance across a wide range of natural language understanding and generation tasks, from open-ended dialogue and code synthesis to mathematical reasoning and scientific question answering. Yet a critical question remains largely unanswered: can LLMs reason and optimize under constraints? Real-world decision-making problems, spanning power grid management, financial operations, and cyber-security, require not only language competence but also the ability to jointly interpret structured inputs, perform multi-step arithmetic, satisfy interacting physical or logical constraints, and converge to feasible, near-optimal solutions. These challenges go far beyond what current benchmarks assess.
Existing benchmarks, however, fall short in evaluating these capabilities in a rigorous and realistic manner. General reasoning benchmarks such as MMLU and GPQA assess broad knowledge and expert-level question answering, but do not require iterative numerical optimization or constraint satisfaction over structured physical systems. Logical reasoning benchmarks such as ARC-AGI, SATBench, and ZebraLogic probe constraint satisfaction in formal or combinatorial settings, but rely on synthetic puzzles disconnected from real-world engineering complexity. Prior work on constraint satisfaction can be organized into three main categories: works focused on the problem formulation from natural language, works on direct constraint satisfaction and combinatorial reasoning, and neuro-symbolic methods that translate problems into symbolic representations and rely on external solvers. In contrast, our work focuses on a demanding task requiring end-to-end model competence: jointly interpreting context, satisfying the constraints, and producing accurate responses in a hard numerical problem.
Across virtually all tasks requiring genuine optimization under constraints, models remain at a constraint satisfaction rate of approximately 55–60%, regardless of architecture, scale, or training regime. Reasoning models, despite their extended chain-of-thought generation, do not systematically outperform their non-reasoning counterparts. Supervised fine-tuning improves response formatting but fails to improve physical feasibility, confirming shortcuts in reasoning. Reinforcement learning with constraint-satisfaction rewards, however, yields modest but meaningful improvements on some grid topologies.
Since LLMs cannot actually perform Newton-Raphson iterations in their "latent space," they often resort to Result Guessing. If the model recognizes the input as "similar to a standard power grid, or if they recognize similar financial or cyber-security datasets, it will provide values that look like a valid solution. Our results using the N-1 out of distribution test set demonstrate an increased error compared to the N case (in-distribution) that suggest that even under fine-tuning and GRPO, LLMs mainly rely on memorization mechanisms. Using an LLM to evaluate OPF without tools is actually testing its calculating limitations, not only its logical reasoning. One could argue that a better paradigm is to restrict LLMs to abstraction tasks only: reading the grid data, understanding the physics, deducing the correct constraints, and writing the mathematical formulation into solver code. Then leaving the solver execute the heavy numerical matrix arithmetic and return the result.
Reinforcement learning (RL) with outcome-based rewards has demonstrated great success in improving the reasoning capabilities of Large language models (LLMs) (Wang et al., 2024; Srivastava & Aggarwal, 2025; Xu et al., 2025; Guo et al., 2025). Recently, it has received increasing attention in building agents based on LLMs, where the agent needs to interact with the environment and resolve tasks beyond single-turn interactions (Zhang et al., 2025a; Plaat et al., 2025). Usually, the tasks do not provide clear statements, e.g., underspecified user queries, and hence the agent needs to ask questions to acquire the missing information strategically, i.e., multi-turn active reasoning (Zhou et al., 2025; Wu et al., 2025; Laban et al., 2025; Li et al., 2025).
Despite the success, we find that training an LLM agent with outcome-based RL suffers from information self-locking (SeL). Under SeL, agents often get stuck in low-information interaction patterns, where the agent ceases to ask informative questions and struggles to internalize already-obtained information. This aligns with existing failure modes of agents in the real-world use (Wang et al., 2025b). To understand more about the failure modes of agents, we propose to decompose agentic behaviors in active reasoning into action selection (AS), which determines what information is queried, and belief tracking (BT), which governs how acquired evidence is internalized and affects the final outcome. Across two multi-turn active reasoning benchmarks in Sec. 2, we show that the two capabilities can not get effectively improved even when the task rewards increase. Hence, it raises a challenging research question:
Why does SeL happen and how to mitigate it?
Belief tracking is essential to the success of active reasoning. Essentially, the agent needs to model its belief bMt ∈ Δ(S) about the progress of the problem solving and what information remains missing throughout turn t ∈ {0, . . . ,H}. The behaviors of an agent with parameters ω in active reasoning can be decomposed into two coupled processes: Action Selection (AS): The agent selects an action (e.g., a question) according to a belief-conditioned policy at ∼ πQ ω (· | bMt ) aiming to elicit informative observations ot ∼ O(·|s⋆, at) from the environment; Belief Tracking (BT): After receiving an observation ot ∈ O, the agent updates its belief via an internal update operator bMt +1 = πU ω (bMt , at, ot), integrating information accumulated over previous interaction rounds.
2.3. Failure modes in reinforcement learning training Despite the success of RL with outcome-based rewards, interestingly, we find that LLM agents exhibit several failure modes across both active-reasoning testbeds during training. Observation 1: Reward improvements do not translate into increased information acquisition. Fig. 2a (PE-G) and Fig. 2b (MediQ) report the training dynamics of episode reward, per-turn AS, and per-turn BT. Across both datasets, we observe a pronounced decoupling: while the reward can be improved over training, BT exhibits only limited gains, and AS fails to improve, often plateauing or even degrading. This observation raises interesting questions about the confounding behaviors of AS and BT, as AS can not improve even with an improved BT.
To isolate the effect, we analyze the relationship between AS and reward under different BT capabilities. Specifically, we fix identical action sequences and compare outcomes when the observation stream is processed by (i) the agent’s internal BT versus (ii) stronger belief-update mechanisms, e.g., human-defined update rules or frontier reasoning models. Since the environment dynamics and action sequences are identical across conditions, the difference can be attributed solely to the belief update mechanism.
Observation 2: Weak belief tracking masks the contribution of informative actions. As shown in Fig. 2c and 2d, the correlation between AS and reward is substantially higher under strong BT, but remains weak when using the agent’s own BT. This indicates that the contribution of AS to the reward is masked when belief updates are unreliable: even high-information actions yield little reward improvement if their information is not incorporated into internal belief. As a result, policy optimization cannot yield stable learning signals to reinforce informative AS choices. Observation 3: Conservative action selection limits belief refinement and induces interaction-insensitive shortcuts.
Complementary to Obs. 2, we now examine the reverse direction of the coupling. When AS gets conservative and yields little informative evidence, BT is deprived of meaningful signals to learn from. Under outcome-only supervision, this even incentivizes shortcut behaviors that reduce reliance on interaction, reinforcing a low-information training regime. We observe that as training progresses, agents become less sensitive to informative observations and increasingly rely on early-stage context. In MediQ, we intervene by replacing all patient feedback with Unknown while keeping all other configurations unchanged. Notably, the induced performance drop becomes smaller after RL training (41.25→30.50 w/o RL versus 61.00→55.50 with RL; see Fig. 5a), suggesting that interaction-derived evidence has a weaker causal effect on the final decision. Crucially, this reduced sensitivity is accompanied by an increase in belief consistency (Fig. 5a; 78.7 w/o RL versus 92.8 with RL): the agent increasingly adheres to its initial judgment instead of revising beliefs in response to interaction, which reflects a more “stubborn” belief update pattern. Together, these form interaction under-utilization: once conservative AS restricts information exposure and weak BT struggles to internalize evidence, RL pressure favors non-interactive heuristics that stabilize outcomes while further suppressing exploration and evidence usage.
Information Self-Locking of RL training for active reasoning. Taken together, these observations indicate SeL emerges from a bidirectional coupling between AS and BT. The reward-relevant value of AS is mediated by the agent’s ability to absorb information through BT, while BT is in turn constrained by the information budget induced by AS. This mutual dependence can trap training dynamics in a low information regime, giving rise to a self-locking behavior.