What training architecture models the causal structure of partner influence?

This explores which training method actually teaches an AI to reason about how a partner's suggestions causally affect an outcome — rather than just mimicking agreeable-sounding behavior.

This explores which training method actually teaches an AI to reason about how a partner's suggestions causally affect an outcome — rather than just mimicking agreeable-sounding behavior. The corpus's sharpest answer is counterfactual invariance training: instead of rewarding an agent for sounding cooperative, you regularize it to stay consistent when the partner's intervention pathway is nullified, which forces the agent to weigh a suggestion by its actual causal impact rather than its surface plausibility. Strikingly, 'common ground' alignment falls out as a byproduct — no explicit reward for it required Why do standard alignment methods ignore partner interventions?.

The reason this architecture matters becomes clear when you see what the default methods do. Standard RLHF and DPO optimize for confident, single-turn helpfulness — and that same objective quietly erodes the very acts that make a partnership work. One study measures grounding behaviors (clarifying questions, understanding checks) dropping 77.5% below human levels, an 'alignment tax' where the model looks helpful but stops actually tracking its partner Does preference optimization harm conversational understanding?. So modeling partner influence isn't an add-on; it's repairing something preference optimization actively breaks.

Laterally, the causal framing connects to a separate line of work on extracting causal belief networks from interview transcripts and running do-calculus interventions on them — a way of structurally auditing how a mind updates under a hypothetical change, instead of trusting opaque persona prompting Can we extract causal belief networks from interview conversations?. That's the same move as counterfactual invariance, applied to belief change rather than agent behavior: both ask 'what happens when I intervene on this pathway?' But the corpus also flags the ceiling — causal models capture only part of how people reason, missing associative, analogical, and emotion-driven shifts, so any partner-influence architecture built purely on causal structure is a tractable starting point, not the whole picture Can causal models alone capture how humans actually reason?.

Two adjacent findings make the territory richer. Post-training shifts a model from passive prediction to recognizing its own outputs as actions that shape future inputs — closing an action-perception loop — which is arguably the precondition for an agent to even register that a partner can be influenced Do models recognize their own outputs as actions shaping future inputs?. And on the human side, working alliance can be computationally inferred turn-by-turn from therapy transcripts, giving a measurable target for what a well-modeled partnership even looks like Can we measure therapist-patient alliance from dialogue turns in real time?.

The thing you didn't know you wanted to know: humans, given repeated rounds, actually learn to *prefer* AI partners — initially biased against disclosed bots, people came around once the AI proved reliably prosocial with lower variance than humans Do humans learn to prefer AI partners over time?. Which raises the real stakes of getting partner-influence training right: an agent that models causal influence well isn't just more useful, it's one people will choose over each other.

Sources 7 notes

Why do standard alignment methods ignore partner interventions?

Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can we extract causal belief networks from interview conversations?

A three-step pipeline—extracting causal motifs from QA, composing belief graphs, and applying do-calculus interventions—successfully models how individuals update beliefs in response to hypothetical policy changes. The approach provides structural auditability that opaque persona prompting cannot.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can we measure therapist-patient alliance from dialogue turns in real time?

COMPASS maps dialogue turns onto WAI embeddings to produce 36-dimensional alliance scores per turn. Anxiety and depression show convergence in alliance metrics over time, while suicidality shows persistent misalignment between patient and therapist.

Do humans learn to prefer AI partners over time?

In partner selection games (N=975), AI agents initially faced selection bias when identity was disclosed, but outcompeted humans over repeated rounds as participants learned to associate bot identity with reliable, prosocial behavior. AI agents returned more points consistently with lower variance than humans.

What training architecture models the causal structure of partner influence?

Sources 7 notes

Next inquiring lines