How does random walk length control reasoning complexity in question generation?
This explores how, in synthetic data generation that walks across a knowledge graph, the number of hops in each walk sets how many reasoning steps a generated question demands — and whether walk length is really the right dial for 'difficulty.'
This explores how the length of a random walk across a knowledge graph translates into the reasoning complexity of the questions you generate from it — and the corpus has one paper aimed squarely at the mechanism, plus several that complicate the easy assumption that 'longer walk = harder question.' The direct answer comes from Can knowledge graphs generate training data for search agents?: each step in a walk traverses one relation between entities, so a walk of length N becomes an N-hop question that can only be answered by chaining N facts together. Length is the knob for *required* reasoning depth, and the second knob — selectively blurring entity names so they can't be looked up directly — forces the model to actually search and infer rather than pattern-match. Together they let you dial verifiable, multi-hop difficulty up or down on demand, which is how DeepDive-32B was trained to beat much larger models on hard search benchmarks.
But here's the thing the walk-length framing hides: hop count is a proxy for complexity, not complexity itself. Do language models fail at reasoning due to complexity or novelty? found that models don't actually break at some number-of-steps threshold — they break at *unfamiliarity*. A long chain succeeds if the model has seen similar instances, and a short one fails if the instance is novel. So a length-7 walk over well-trodden entities may be easier than a length-3 walk into an obscure corner of the graph. Walk length controls *nominal* reasoning depth; entity blurring and graph region control the *effective* difficulty, and that second factor may matter more.
There's also a ceiling worth knowing about. If you generate ever-longer questions thinking longer means better training signal, Why does chain of thought accuracy eventually decline with length? shows accuracy follows an inverted-U: past an optimal length, more reasoning steps *hurt*, and the optimum shrinks as the model gets more capable. Pair that with Does reasoning ability actually degrade with longer inputs?, where accuracy fell from 92% to 68% with just a few thousand tokens of padding — and you see that piling on hops can degrade performance through sheer length before it ever tests deeper reasoning. Longer walks risk measuring length-fragility, not reasoning.
The cross-cutting lesson is that walk length is the *generation-side* control, but a good question has more dimensions than depth. Can models learn to ask genuinely useful clarifying questions? decomposes question quality into separate attributes — clarity, relevance, specificity — and trains on each independently rather than on a single difficulty score. Read alongside the random-walk method, it suggests a richer recipe: walk length gives you verifiable multi-hop structure, entity blurring gives you search-hardness, and attribute-level shaping gives you questions that are hard *and* well-posed — which matters, because Why do reasoning models overthink ill-posed questions? shows models will burn enormous reasoning effort on ill-posed questions instead of rejecting them. A long walk that accidentally generates an unanswerable chain doesn't teach reasoning; it teaches overthinking.
Sources 6 notes
KG-based random walks with selective entity obscuring create verifiable, multi-hop questions that train deep search agents effectively. DeepDive-32B trained on this data achieves 14.8% on BrowseComp, outperforming larger models through end-to-end multi-turn RL.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.