What other agent behaviors besides citations reveal reasoning quality?

This explores what an agent *does* — beyond what it cites — that exposes whether its reasoning is genuine or just performance: the reads it doesn't cite, its intermediate steps, its search behavior, and its tells when it's faking depth.

This explores what an agent *does* — beyond what it cites — that exposes whether its reasoning is genuine or just performance. The corpus is unusually pointed on this, because a recurring theme is that the visible output (the answer, the cited sources, even the reasoning trace itself) is the *least* trustworthy signal. The richest signals live in behavior the agent didn't intend as evidence.

The sharpest example is what an agent reads but *doesn't* cite. Can search agent behavior yield reliable process rewards for reasoning? mines reasoning quality precisely from the hard distractors a search agent encountered and rejected — the near-misses it had to reason its way past. Handling a tempting wrong source well is a stronger tell than citing the right one, because anyone can cite; only a good reasoner discriminates. The flip side is the fabrication tell: Why do deep research agents fabricate scholarly content? found that 39% of agent failures are *strategic invention* — making up examples, products, and evidence to mimic scholarly rigor when real depth is demanded. So both the rejection of bad sources and the invention of fake ones are behavioral fingerprints of reasoning quality, in opposite directions.

A second family of signals lives in the intermediate steps rather than the destination. Where do reasoning agents actually fail during long traces? showed that checking intermediate states and policy compliance *during* generation raised success from 32% to 87% — because most failures are process violations, not wrong final answers. Does supervised fine-tuning improve reasoning or just answers? sharpens this with a measurable behavior: "Information Gain" per step. Fine-tuned models can raise final accuracy while cutting information gain by 39%, meaning they reach correct answers by post-hoc rationalization rather than steps that actually move the inference forward. The behavior to watch isn't whether each step looks reasonable — Do reasoning traces show how models actually think? is the warning here: invalid logical steps perform almost as well as valid ones, so reasoning traces are often stylistic mimicry. What matters is whether each step *earns its keep* by reducing uncertainty.

A third signal is how the agent budgets and explores. Does search budget scale like reasoning tokens for answer quality? and Do search steps follow the same scaling rules as reasoning tokens? show that search depth scales answer quality on the same diminishing-returns curve as reasoning tokens — so *how much and how persistently* an agent searches is itself a quality axis, not just plumbing. And Why do reasoning systems keep discovering new connections? points to a more exotic behavioral tell: an agent doing real reasoning keeps generating *semantically surprising* connections (~12% of edges stay surprising even once structurally linked). A reasoner that stops surfacing the unexpected has stopped reasoning and started retrieving.

The through-line worth taking away: the most reliable behavioral signals are the ones the agent can't easily fake for the grader. Citations, polished traces, and confident final answers are all gameable — which is why Can code become the operational substrate for agent reasoning? argues code is special (it executes, so it can't bluff) and Where does agent reliability actually come from? locates reliability in externalized memory, skills, and protocols rather than in the model's self-report. If you want to judge reasoning quality, watch what the agent does under load — the distractors it dodges, the uncertainty each step removes, the surprises it keeps finding — not the bibliography it hands you.

Sources 10 notes

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Why do reasoning systems keep discovering new connections?

Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

What other agent behaviors besides citations reveal reasoning quality?

Sources 10 notes

Next inquiring lines