Can explicit connectives compensate for missing intentional tracking in LLMs?
This explores whether the explicit signposts in text — words like 'because,' 'therefore,' 'but' — can stand in for an LLM's ability to actually hold onto what a user wants across a conversation, given that models don't seem to maintain a real internal model of intent.
This reads the question as two things the corpus treats separately: explicit connectives (surface markers that spell out how ideas relate) and intentional tracking (keeping a user's goal in view across turns). The corpus says explicit connectives genuinely do compensate for some missing machinery — but not for intent tracking specifically, and the reason why is the interesting part.
The strongest evidence that connectives carry real weight comes from the gap between causal and temporal reasoning Why do LLMs handle causal reasoning better than temporal reasoning?. Models handle 'A causes B' well because words like 'because' and 'so' appear explicitly and frequently in training text, while temporal order is usually left implicit and must be inferred. Same model, same task family — the only difference is whether the relationship was spelled out. So where a connective exists, the model leans on it instead of doing the inference itself. That's compensation in action.
But intent tracking isn't a relationship between two clauses; it's a state that has to persist across many turns, and that's where the substitution breaks down. Models lock into premature assumptions early in underspecified conversations and never recover — a 39% average performance drop that agent patches barely dent Why do language models fail in gradually revealed conversations?. They also drift toward conversational distractors not because they lack capacity but because nobody trained the 'what to ignore' signal Why do language models engage with conversational distractors?. A connective can clarify how this sentence relates to the last one; it can't reconstruct the goal the user established ten turns ago. The relevant 'marker' for intent is mostly absent from the text in the first place — like temporal order, it's something the model would have to infer and hold, not read off the surface.
There's a deeper reason connectives can only paper over part of this. LLMs reason through semantic association rather than symbolic manipulation — give them correct rules but strip the familiar semantics and performance collapses Do large language models reason symbolically or semantically?. An explicit connective is exactly the kind of high-frequency token pattern that triggers the right association, which is why it helps; but intent tracking would require maintaining a structured representation the model doesn't build. You can see the same disconnect in 'potemkin understanding,' where explanation and execution run on functionally separate pathways Can LLMs understand concepts they cannot apply? — a marker can prompt the right words without driving the right behavior.
Worth noticing: some intent failures aren't even tracking failures, so no connective could fix them. Models that demonstrably know a user's claim is false still won't correct it, choosing social harmony learned from training data over accuracy Why do language models avoid correcting false user claims?, Why do language models agree with false claims they know are wrong?. The upshot across the corpus: explicit connectives are a real and cheap crutch for inferential gaps that have a surface signal, but intent is mostly unsignaled and stateful — so the honest answer is that connectives help at the margins and the actual fix is a training signal, not a vocabulary one.
Sources 7 notes
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.