INQUIRING LINE

What geometric structure do language models actually use during inference?

This explores what kind of internal geometry — coordinate systems, layer structure, token rankings — language models rely on when they're actually running, as opposed to the clean symbolic structures we imagine they use.


This explores what kind of internal geometry language models actually deploy at inference — and the corpus's answer is surprising: the geometry is real and structured, but it's statistical in origin and used in ways that look nothing like clean symbolic reasoning. Start with the most concrete finding: inside the activations, models lay out syntax in something like a polar coordinate system, encoding both the *type* of a relationship (angle) and its *direction or strength* (distance), which nearly doubles probing accuracy over treating distance alone How do language models encode syntactic relations geometrically?. So there genuinely is a learned, structured geometry — networks spontaneously build symbol-compatible shapes nobody designed for them.

But where does that shape come from? Not from any dedicated 'hierarchy module.' The nested, tree-like concept geometry you can measure in embeddings falls straight out of the spectral structure of word co-occurrence statistics — you can predict and reproduce it just by analyzing which words appear near which other words Where does hierarchical structure in language models come from?. The geometry is downstream of corpus statistics, which reframes the whole question: the model isn't using a structure it was taught, it's inheriting the shape of language itself.

The catch is that this statistically-derived geometry is used semantically, not symbolically. Decouple meaning from logical form and reasoning performance collapses even when the correct rules sit right there in the context — models navigate by token associations and parametric commonsense, not formal manipulation Do large language models reason symbolically or semantically?. That same statistical-over-structural bias shows up as systematic blind spots that worsen predictably with syntactic depth Why do large language models fail at complex linguistic tasks?, and as failures that track *instance novelty* rather than task complexity — models fit patterns of specific examples rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. The geometry encodes what's familiar, and bends where the training distribution thins out.

The most counterintuitive layer of the answer is *where in the network* the real computation lives. Logit-lens analysis shows transformers can compute correct answers in their earliest layers, then actively overwrite those representations in later layers to emit format-compliant filler — the reasoning is there, just suppressed and recoverable from lower-ranked predictions Do transformers hide reasoning before producing filler tokens?. Models also internally rank tokens by functional importance, preserving symbolic-computation tokens while discarding grammar and meta-discourse Which tokens in reasoning chains actually matter most?. Depth itself turns out to matter more than width for small models, because composing abstract concepts happens *across layers* Does depth matter more than width for tiny language models? — the geometry is a vertical pipeline, not a flat lookup.

The thing you may not have known you wanted to know: even the model's identity is geometric superposition rather than commitment. Shanahan's 20-questions test shows a model holds a *distribution* over consistent characters and samples one at generation time — regenerate and you get a different answer, each internally consistent, none fixed Do large language models actually commit to a single character?. So the structure language models use at inference is best read as a probability landscape carved by corpus statistics, traversed semantically, computed across depth, and collapsed into a single sample only at the moment of output — not a symbolic machine, but a geometry of likelihood.


Sources 9 notes

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Where does hierarchical structure in language models come from?

LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Next inquiring lines