Do larger language models overcome greediness in sequential decision-making?

This explores whether scaling up model size fixes the tendency of LLMs to act 'greedily' in sequential decisions — grabbing the locally best option instead of exploring — and the corpus speaks to this obliquely, mostly by showing that scale rarely cures a *structural* bias.

This reads the question as: when an LLM has to make a sequence of choices, it tends to behave greedily — taking the immediately rewarding move rather than the one that pays off later — and you're asking whether bigger models simply grow out of it. The collection doesn't have a paper that runs the exact bandit-or-exploration experiment, so I'll be upfront about that. But several notes converge on a more interesting answer: the failures that look like greediness tend to be *structural*, and structural failures don't reliably dissolve with scale.

The sharpest evidence is the finding that what looks like good reasoning is often just a default. Across fourteen models, most actually performed *worse* when constraints were removed — they were defaulting to the harder, safer option rather than evaluating the situation, and that 'conservative bias' was hiding behind apparent reasoning success Are models actually reasoning about constraints or just defaulting conservatively?. That's the cousin of greediness: a fixed policy masquerading as deliberation. Scale didn't wash it out. In the same spirit, framing LLMs as autoregressive probability machines predicted *which* tasks they'd fail on — low-probability targets stay hard even when they're logically trivial, and the difficulty tracks the architecture, not the parameter count Can we predict where language models will fail?.

The strategic-reasoning work cuts against a simple 'bigger is less greedy' story from another angle. When 22 models were dropped into behavioral game theory, performance correlated with *game structure*, not raw reasoning depth — different frontier models settled into distinct fixed styles (minimax, trust-based, belief-anticipation) Do large language models use one reasoning style or many?. So a model isn't 'greedy' or 'not greedy' in the abstract; its myopia is conditional on the decision's shape. That's a hint that you fix greediness by changing the decision procedure, not by adding parameters.

And that's where the corpus points toward what actually helps: making the model decide *how* to decide. The 'learn when to think versus answer fast' work trains a single model to route between extended deliberation and a quick response, instead of always reaching for one mode Can models learn when to think versus respond quickly?. Calibration is the other lever — small models trained to know when they're uncertain and abstain matched models ten times their size, which says the missing ingredient is a learned sense of when the obvious move is wrong, not sheer capacity Can models learn to abstain when uncertain about predictions?. Even reward design matters here: using the model's own confidence as a training signal restores calibration *while* improving step-by-step reasoning Can model confidence work as a reward signal for reasoning?.

The quiet payoff: the collection reframes 'greediness' as a calibration-and-procedure problem rather than a size problem. If you came expecting 'yes, GPT-N+1 explores better,' the more useful takeaway is that the cure shows up when a model learns *when to deliberate* and *when to doubt the obvious answer* — and those capacities exist, undertrained, in models that are already small.

Sources 6 notes

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do larger language models overcome greediness in sequential decision-making?

Sources 6 notes

Next inquiring lines