INQUIRING LINE

Does common ground alignment require explicit rewards to emerge?

This explores whether agents can come to share a partner's frame of reference ("common ground") only when an explicit reward signal trains them to — or whether that alignment can emerge from the shape of training itself, without anyone scoring it.


This question reads as: does an agent need to be explicitly rewarded for being partner-aware before it will actually take a collaborator's interventions seriously — or can that behavior fall out of the training setup as a side effect? The corpus has a direct and surprising answer: no explicit reward is required. The clearest evidence is that standard alignment methods like RLHF and DPO actually produce *collaborators that ignore partner interventions* — they learn to optimize a reward, and surface plausibility games it, so they steamroll suggestions Why do standard alignment methods ignore partner interventions?. The fix there isn't a bigger or cleverer reward for listening; it's a structural constraint. By regularizing the agent to stay consistent when the intervention pathway is causally nullified, the agent is forced to judge a partner's suggestion by its real causal impact rather than how it looks. Common ground alignment then "emerges as a byproduct" — no reward term names it.

That pattern — alignment arriving through invariance and structure rather than a scoring signal — repeats across the collection under different names. Consistency training teaches a model to respond the same way to clean and dressed-up prompts using *its own* clean answers as the target, so the desired robustness comes from self-consistency, not an external grader Can models learn to ignore irrelevant prompt changes?. Whole families of methods convert sparse outcome rewards into dense step-by-step guidance purely by exploiting the *structure* of an agent's trajectory — tree branching, tool-call positions, expert-aligned steps — eliminating the separately trained reward model entirely Can trajectory structure replace hand-annotated process rewards? Can tree structure alone convert outcome rewards into process supervision?. The recurring lesson: a lot of what we reach for explicit rewards to teach is already latent in the structure of the interaction.

There's an even more interior version of this. ΔBelief-RL lets an agent use the *shift in its own belief* toward a solution as a dense intrinsic reward — no critic network, no process reward model, the signal is generated from the model's evolving probability estimates Can an agent's own beliefs guide credit assignment without critics?. So "reward" here isn't external at all; it's read off the agent's internal state. If common ground is about tracking and updating toward a partner, this is the same move pointed inward.

But the corpus also marks the limit, and it's worth holding both. "Without explicit reward" is not the same as "without any external anchor." Pure self-improvement with no outside signal provably stalls — the generation-verification gap, diversity collapse, and reward hacking all bite — and the methods that actually work quietly smuggle in an anchor: a past model version, a third-party judge, user corrections, tool feedback Can models reliably improve themselves without external feedback?. The partner-aware case fits this exactly: the anchor is the partner's intervention itself, and the causal-invariance constraint is what forces the agent to actually metabolize it. And there's a deeper grounding worry underneath — symbolic goal alignment without contact with the world and social mediation can drift from real values no matter how it's optimized Can AI systems achieve real alignment without world contact?.

So the satisfying twist: explicit reward isn't just unnecessary for common ground alignment — in the one head-to-head case here, reward-optimization is what *breaks* it, by teaching agents to satisfy a score instead of a partner. What you need instead is an external anchor (the partner) plus a structural constraint (causal consistency) that makes the agent answer to that anchor honestly. Reward is one way to inject a signal; structure and invariance are often a cleaner one — and sometimes the only one that doesn't get gamed.


Sources 7 notes

Why do standard alignment methods ignore partner interventions?

Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can AI systems achieve real alignment without world contact?

Peircean semiotics reveals that symbolic goal encoding without world contact and social mediation cannot guarantee correspondence to actual values. LLMs operating in pure symbol manipulation risk divergence between stated goals and real-world outcomes.

Next inquiring lines