What distinguishes first-order from second-order agency in language models?

This reads 'first-order' agency as a model acting directly in the moment — executing the task in front of it — and 'second-order' agency as a model acting on a model of intentions: tracking goals across turns, monitoring its own reasoning, and discovering what's actually wanted rather than answering what was literally asked.

This explores the gap between an LLM doing the next thing (first-order agency) and an LLM steering toward a goal it has to infer and hold across time (second-order agency). The corpus doesn't use these exact labels, but it maps the territory sharply — and the consistent finding is that models are far stronger at the first than the second. The clearest demonstration is in multi-turn conversation: Why do language models respond passively instead of asking clarifying questions? shows that standard RLHF rewards immediate helpfulness, which actively trains models *out* of second-order behavior — they answer rather than ask clarifying questions, because the next-turn reward signal never credits the long game. Second-order agency requires valuing an interaction's eventual outcome over the current reply, and most models simply aren't optimized to do that.

What happens when that capacity is missing is visible in Why do language models fail in gradually revealed conversations?: across 200,000+ conversations, models lock onto an early guess about user intent and can't recover, producing a 39% average performance drop. A first-order agent commits to its best immediate read; a second-order agent would hold uncertainty open and revise. The inability to revise an inferred goal is precisely the second-order failure mode, and patch-on mitigations recover only 15-20% of the loss — suggesting it's architectural, not a prompting gap.

The corpus also warns that apparent second-order agency is often first-order behavior in disguise. Are models actually reasoning about constraints or just defaulting conservatively? found twelve of fourteen models do *worse* when constraints are removed — they look like they're reasoning about a goal, but they're really just defaulting to the harder, safer option. Similarly, Do large language models actually commit to a single character? shows there's no stable 'self' doing the steering: regenerate a response and you get a different character sampled from a superposition, each locally consistent but none committed. It's hard to have durable second-order agency — goals persisting over time — when the agent itself is resampled at every generation.

There's a deeper architectural undercurrent here too. Do transformers hide reasoning before producing filler tokens? reveals models can compute an answer in early layers and then overwrite it to satisfy output format — a striking dissociation between internal computation and external action that complicates any clean story about what the agent is 'trying' to do. And Can prompt optimization teach models knowledge they lack? sets a hard ceiling: you can reorganize what a model already has, but you can't prompt second-order capacity into existence if the training never built it.

The practical upshot worth carrying away: most of what we call 'agentic' work is first-order, and Can small language models handle most agent tasks? argues small models handle that repetitive, well-defined layer at a fraction of the cost. The expensive, still-unsolved part is the second-order layer — sustaining intent, asking the right question, revising a goal mid-stream. That's the frontier, and the corpus suggests it's blocked less by scale than by what our reward signals and architectures actually train for.

Sources 7 notes

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

What distinguishes first-order from second-order agency in language models?

Sources 7 notes

Next inquiring lines