Does sentence-level granularity capture enough structure for complex reasoning tasks?
This explores whether reasoning that operates at the sentence level — like Meta's Large Concept Model, which thinks in whole-sentence embeddings rather than word-by-word tokens — preserves the fine structure that hard reasoning needs, or whether the bottleneck for complex reasoning lives somewhere else entirely.
This explores whether sentence-level granularity captures enough structure for complex reasoning tasks — and the corpus suggests the granularity question is real but probably the wrong place to look for the bottleneck. The strongest case *for* sentences comes from Meta's Large Concept Model, which reasons over sentence embeddings in a language-agnostic space and plans at the paragraph level before decoding to words Can reasoning happen at the sentence level instead of tokens?. The bet there is that whole-sentence units are the right altitude for coherent planning. But a quiet counter-current runs through the corpus: when researchers look *inside* reasoning chains, the load-bearing structure turns out to be much finer than a sentence. Only about 20% of tokens — the high-entropy 'forking points' — actually carry the learning signal in RLVR Do high-entropy tokens drive reasoning model improvements?, and models internally rank tokens by function, preferentially preserving symbolic-computation tokens while pruning grammar and meta-discourse Which tokens in reasoning chains actually matter most?. If the decisive moves happen at a handful of pivotal tokens, a sentence-level representation risks smearing the signal across the very units that matter most.
Yet the same fine-grained evidence cuts the other way too — and this is the surprise. Much of what lives at the token level is *not* computation at all. Chain of Draft matches full chain-of-thought accuracy using only 7.6% of the tokens, because the other 92% served style and documentation rather than reasoning Can minimal reasoning chains match full explanations?. And transformers appear to compute their answers in the first few layers, then actively overwrite that reasoning with format-compliant filler before emitting it Do transformers hide reasoning before producing filler tokens?. So the token stream is mostly padding wrapped around a few critical decisions — which is exactly the condition under which a coarser, sentence-level abstraction might *help*, by throwing away the verbosity and keeping the concepts.
The deeper reframing is that granularity may not be the binding constraint on complex reasoning at all. When reasoning models 'collapse,' the failure is often execution, not representation: text-only models can't carry out long multi-step procedures even when they know the algorithm, and giving them tools pushes the supposed reasoning cliff far back Are reasoning model collapses really failures of reasoning?. Failures also track instance-novelty rather than complexity — models succeed on any chain resembling their training instances and break on unfamiliar ones, suggesting they fit patterns rather than run general algorithms Do language models fail at reasoning due to complexity or novelty?. Chain-of-thought itself degrades predictably the moment you leave the training distribution, producing fluent but logically invalid steps Does chain-of-thought reasoning actually generalize beyond training data?. None of these failure modes would be fixed by changing the unit of reasoning from token to sentence.
There's also a structural-complexity story that's orthogonal to token-vs-sentence: LLMs make systematic linguistic errors that worsen as syntactic depth increases — misreading embedded clauses and complex nominals — because statistical learning captures surface patterns, not deep grammar Why do large language models fail at complex linguistic tasks?. And reasoning degrades sharply with input length far below the context window, in a way uncorrelated with language-modeling skill Does reasoning ability actually degrade with longer inputs?. The structure that complex reasoning needs, in other words, is partly about *which* structure you route to: StructRAG shows that matching the knowledge representation (table, graph, algorithm, catalogue) to the task's demands beats uniform retrieval Can routing queries to task-matched structures improve RAG reasoning?. That's the real lesson the corpus offers your question — 'enough structure' isn't a fixed property of sentence-level granularity; it's a fit problem between the representation and the task.
So the honest answer is: sentence-level granularity captures enough structure for *coherence and planning*, and may even help by stripping the documentary padding that dominates token streams — but it does nothing for the failures that actually limit complex reasoning, which are execution bandwidth, distribution shift, instance-novelty, and representation-task mismatch. The thing you didn't know you wanted to know: most of a reasoning chain is decoration, the real work hides in a minority of tokens and the earliest layers, and arguing about token-vs-sentence granularity is arguing about the wrapper while the bottleneck sits elsewhere.
Sources 11 notes
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.