INQUIRING LINE

Why do longer forecasting horizons degrade LLM accuracy in role-play?

This reads the question as asking why an LLM playing a character (or simulating a user) gets less accurate the further out it has to project — more turns ahead, more steps into a scenario — and what mechanism makes that decay compound rather than stay flat.


This explores why role-playing LLMs lose accuracy as the horizon stretches — whether that horizon is more dialogue turns, more steps into a forecast, or a longer-running persona. The short version the corpus keeps returning to: error doesn't stay constant across a horizon, it compounds, and role-play has no built-in mechanism to wash that error out. Each step inherits the last step's mistakes plus its own. The longer the horizon, the more an early wrong commitment has time to metastasize.

The clearest picture of the accumulation comes from work on persona drift, which separates two distinct failure types: local drift within a single turn and global drift that builds across a whole conversation Can training user simulators reduce persona drift in dialogue?. That distinction matters for your question — global drift is precisely the horizon effect. It's small per turn and nearly invisible early, but it integrates. The 'wrong turn' research sharpens why it can't self-correct: models lock into early guesses when information arrives gradually and then can't course-correct, dropping from 90% accuracy on a single-shot instruction to 65% across a natural multi-turn exchange Why do AI assistants get worse at longer conversations?. A long forecasting horizon is just a long chain of these early commitments, each one constraining the next.

There's a second, less obvious reason the corpus surfaces: role-play is character-consistent text production, not genuine state-tracking Should we treat dialogue agents as role-playing characters?. The model is generating continuations that *look* like the character, not running a persistent internal model of where that character actually is. So when a simulator has to hold a goal across many turns, the goal quietly decays unless something forces it to stay tracked — which is exactly the gap the goal-state-tracking work tries to close by decomposing a user's goal into independently monitored sub-components (profile, policy, task, requirements, preferences) Why do LLM user simulators fail to track their own goals?. Horizon degrades accuracy because nothing in the base setup is paying the cost of remembering, so the further out you go, the more has been forgotten.

The forecasting research adds the punchline by showing the fix is architectural, not a matter of a bigger model: LLMs forecast far better than people think, but only when the workflow separates numerical reasoning from contextual reasoning — monolithic prompting hides the capability that structured decomposition reveals Can LLMs actually forecast time series better than we think?. Read alongside the broader finding that reliable agents externalize memory, skills, and protocols into a harness layer rather than expecting the model to re-solve them each step Where does agent reliability actually come from?, a pattern emerges: long horizons degrade accuracy when the model is asked to hold everything in its head at once, and stop degrading as steeply when the horizon is broken into externally-tracked pieces.

The thing worth knowing you didn't ask for: there's a failure mode the omniscient-simulation work exposes that's invisible in most benchmarks — when one model controls all the characters, it skips the grounding work real role-play requires, and the cracks only show up under information asymmetry, where one agent has private knowledge the others don't Why do LLMs fail when simulating agents with private information?. Long horizons make this worse because they give asymmetric information more time to diverge. So part of why role-play decays over distance isn't just memory — it's that the model was never doing the bookkeeping a genuinely separate, partially-informed agent would have to do.


Sources 7 notes

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Why do LLM user simulators fail to track their own goals?

The UGST framework breaks user goals into profile, policy, task, requirements, and preferences—each with explicit status tracking. A three-stage method (steering, SFT, GRPO) progressively internalizes goal alignment, reducing the misalignment that corrupts RL training signals.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether longer forecasting horizons still degrade LLM accuracy in role-play, treating a curated library's findings (2023–2026) as dated claims, not current truth.

What a curated library found — and when (dated claims, not current truth):
The library spans 2023–2026. Its core claims:
• Persona drift compounds across turns: global drift (across a conversation) causes accuracy to fall from ~90% single-shot to ~65% multi-turn (2025).
• Role-play is text-generation mimicry, not state-tracking; goals decay unless externally tracked via decomposed sub-components (profile, policy, task, requirements, preferences) (2025).
• LLM forecasting capability is stronger than recognized, but only when numerical reasoning is separated from contextual reasoning; monolithic prompting hides it (2026).
• Agent reliability comes from externalizing memory, skills, and protocols into a harness layer, not embedding them in the model (2026).
• Omniscient simulation (one model controlling all characters) fails under information asymmetry; long horizons amplify divergence (2024).

Anchor papers (verify; mind their dates):
• arXiv:2511.00222 (Oct 2025) — Multi-turn RL for persona consistency
• arXiv:2507.20152 (Jul 2025) — Goal alignment in LLM user simulators
• arXiv:2604.08224 (Apr 2026) — Externalization in LLM agents
• arXiv:2605.14389 (May 2026) — Nexus framework for time-series forecasting

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, determine whether newer models (o1, Gemini 2.0, Claude 4), structured reasoning frameworks (agentic orchestration, memory systems, tool use), or evals have relaxed or overturned the 65% / decay pattern. Separate the durable question ("Do long horizons inherently compound error in role-play?") from the perishable limitation ("Monolithic prompting hides forecasting capability"). Flag which constraints still hold and which may have dissolved.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper showing role-play accuracy *stays stable* across horizons, or that externalization is unnecessary, or that a single model can handle asymmetric information.
(3) Propose 2 research questions that assume the regime *has* shifted: (a) If externalization + structured decomposition now solve horizon decay, what is the NEW failure mode at 100+ turn horizons? (b) If LLM forecasting works well when separated numerically, does that mean role-play horizon degradation is a *problem formulation* issue, not a model limitation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines