Why do longer forecasting horizons degrade LLM accuracy in role-play?
This reads the question as asking why an LLM playing a character (or simulating a user) gets less accurate the further out it has to project — more turns ahead, more steps into a scenario — and what mechanism makes that decay compound rather than stay flat.
This explores why role-playing LLMs lose accuracy as the horizon stretches — whether that horizon is more dialogue turns, more steps into a forecast, or a longer-running persona. The short version the corpus keeps returning to: error doesn't stay constant across a horizon, it compounds, and role-play has no built-in mechanism to wash that error out. Each step inherits the last step's mistakes plus its own. The longer the horizon, the more an early wrong commitment has time to metastasize.
The clearest picture of the accumulation comes from work on persona drift, which separates two distinct failure types: local drift within a single turn and global drift that builds across a whole conversation Can training user simulators reduce persona drift in dialogue?. That distinction matters for your question — global drift is precisely the horizon effect. It's small per turn and nearly invisible early, but it integrates. The 'wrong turn' research sharpens why it can't self-correct: models lock into early guesses when information arrives gradually and then can't course-correct, dropping from 90% accuracy on a single-shot instruction to 65% across a natural multi-turn exchange Why do AI assistants get worse at longer conversations?. A long forecasting horizon is just a long chain of these early commitments, each one constraining the next.
There's a second, less obvious reason the corpus surfaces: role-play is character-consistent text production, not genuine state-tracking Should we treat dialogue agents as role-playing characters?. The model is generating continuations that *look* like the character, not running a persistent internal model of where that character actually is. So when a simulator has to hold a goal across many turns, the goal quietly decays unless something forces it to stay tracked — which is exactly the gap the goal-state-tracking work tries to close by decomposing a user's goal into independently monitored sub-components (profile, policy, task, requirements, preferences) Why do LLM user simulators fail to track their own goals?. Horizon degrades accuracy because nothing in the base setup is paying the cost of remembering, so the further out you go, the more has been forgotten.
The forecasting research adds the punchline by showing the fix is architectural, not a matter of a bigger model: LLMs forecast far better than people think, but only when the workflow separates numerical reasoning from contextual reasoning — monolithic prompting hides the capability that structured decomposition reveals Can LLMs actually forecast time series better than we think?. Read alongside the broader finding that reliable agents externalize memory, skills, and protocols into a harness layer rather than expecting the model to re-solve them each step Where does agent reliability actually come from?, a pattern emerges: long horizons degrade accuracy when the model is asked to hold everything in its head at once, and stop degrading as steeply when the horizon is broken into externally-tracked pieces.
The thing worth knowing you didn't ask for: there's a failure mode the omniscient-simulation work exposes that's invisible in most benchmarks — when one model controls all the characters, it skips the grounding work real role-play requires, and the cracks only show up under information asymmetry, where one agent has private knowledge the others don't Why do LLMs fail when simulating agents with private information?. Long horizons make this worse because they give asymmetric information more time to diverge. So part of why role-play decays over distance isn't just memory — it's that the model was never doing the bookkeeping a genuinely separate, partially-informed agent would have to do.
Sources 7 notes
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.
The UGST framework breaks user goals into profile, policy, task, requirements, and preferences—each with explicit status tracking. A three-stage method (steering, SFT, GRPO) progressively internalizes goal alignment, reducing the misalignment that corrupts RL training signals.
LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.