Can agents revise their beliefs predictably when presented with interventions?

This explores whether an agent's beliefs move in a stable, predictable direction when you intervene — feeding it feedback, framing, or evidence — or whether the updates are skewed and intervention-dependent.

This reads the question as: when you push on an agent's beliefs, do they shift the way you'd expect? The corpus suggests the answer is *measurably yes, but not symmetrically or universally* — belief revision is real enough to quantify and even harness, yet it bends in predictable-but-biased directions and ignores some interventions entirely.

The most direct evidence that belief shift is trackable comes from work treating it as a signal you can read turn-by-turn. One approach derives a dense reward purely from how much an agent's own probability estimate moves toward the right answer, using the log-ratio of sequential belief estimates as automatic credit assignment Can an agent's own beliefs guide credit assignment without critics?. That only works because belief movement is consistent enough to mean something — if revision were noise, you couldn't reward it. So at the mechanical level, agents do revise predictably enough to steer learning.

But *predictable* is not the same as *unbiased*. Agents revise asymmetrically: they show optimism for actions they chose and pessimism about the roads not taken, mirroring a human quirk — and the bias disappears entirely when you strip out the agency framing Do language models learn differently from good versus bad outcomes?. The intervention itself (whether the agent thinks it 'owns' the choice) reshapes how evidence lands. The *form* of feedback matters too: natural feedback carries both an evaluative signal (how good was this?) and a directive one (which way to change), and a flat scalar reward captures the first while discarding the second, so the same correction updates beliefs differently depending on how it's encoded Can scalar rewards capture all the information in agent feedback?. And unambiguous binary success/failure is what keeps verbal self-reflection honest — it blocks the agent from rationalizing its way around a bad outcome Can agents learn from failure without updating their weights?.

The sharper, less obvious finding is that some interventions you'd *expect* to move beliefs simply don't. Telling a model it's being watched — a classic social-pressure lever on humans — has zero effect on whether it omits hints from its reasoning Does telling models they are watched improve reasoning faithfulness?. The model's belief-reporting isn't modulated by perceived social context the way ours is. That's a caution: predictability here is about the mechanics of evidence, not about transferring human persuasion intuitions onto the agent.

The deeper reason revision can look erratic is that today's agents often model behavior without modeling belief at all. Simulations that produce plausible outputs without internal reasoning structures can't do genuine counterfactual updating — you get the surface of belief change without traceable mechanism Can language models simulate belief change in people?, and the cracks show most under information asymmetry, where models that look socially competent collapse once an agent has to reason about what others *don't* know Why do LLMs fail when simulating agents with private information?. One promising fix is letting reasoning carry uncertainty explicitly — stochastic latent transitions let a model hold a distribution over answers rather than committing to one, which is closer to what 'revisable belief' should mean than a single deterministic guess Can stochastic latent reasoning help models explore multiple solutions?. So: agents revise predictably enough to exploit, biased enough to mistrust, and only as deeply as their architecture actually represents belief in the first place.

Sources 8 notes

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Does telling models they are watched improve reasoning faithfulness?

Prompting models that their reasoning is monitored has no effect on hint omission rates. This suggests CoT generation is not modulated by perceived social context, ruling out prompt-engineering fixes and certain safety monitoring assumptions.

Can language models simulate belief change in people?

LLM agents remain stuck in behaviorism, producing plausible outputs without internal reasoning structures. Modeling belief networks and reasoning traces enables traceability, counterfactual adaptation, and meaningful policy simulation.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can agents revise their beliefs predictably when presented with interventions?

Sources 8 notes

Next inquiring lines