Why do LLMs struggle to translate natural language into logical formalizations?
This explores why LLMs reliably produce well-formed logic that doesn't actually mean what the original sentence meant — the gap between syntactic fluency and semantic fidelity in autoformalization.
This explores why LLMs reliably produce well-formed logic that doesn't actually mean what the original sentence meant. The corpus points to a consistent picture: the failure isn't grammatical, it's semantic. Models generate syntactically valid logical expressions whose meaning has quietly drifted, with errors clustering around scope ambiguity, quantifier precision, and predicate granularity Can large language models translate natural language to logic faithfully?. Strikingly, the same work finds models read formal language better than they write it — so the bottleneck is generation, the act of committing natural language to a precise symbolic form, not comprehension.
Why would that be? A deeper note argues LLMs are semantic reasoners, not symbolic ones: they lean on token associations and parametric commonsense rather than manipulating formal structure, and when you strip the familiar semantic content away from a reasoning task, performance collapses even with the correct rules sitting in context Do large language models reason symbolically or semantically?. Formalization demands exactly the thing they're weakest at — treating symbols as symbols, independent of what they 'sound like' they should mean. This connects to a broader family of failures where models can state a principle correctly but cannot execute it: 'Potemkin understanding,' where correct explanation coexists with failed application Can LLMs understand concepts they cannot apply?, and the same explanation-versus-action gap documented elsewhere as a kind of computational split-brain Can language models understand without actually executing correctly?. Translating to logic lives squarely in the 'execution' lane that these split pathways keep failing.
Two other corners of the corpus add texture. First, the difficulty scales with structure: LLMs handle simple sentences fine but degrade predictably as syntactic depth, embedding, and recursion increase, suggesting they learned surface heuristics rather than real grammatical structure Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. Faithful formalization requires parsing exactly those nested, quantified, embedded constructions. Second, much of natural-language meaning lives in what's *unstated* — the background preconditions a formal representation must make explicit. The 'modern frame problem' note shows models fail not from missing knowledge but from not bringing relevant unstated conditions forward, with accuracy jumping from 30% to 85% when you force explicit enumeration Do language models fail at identifying unstated preconditions?.
The most interesting twist is what the corpus says actually works — and it isn't 'try harder at full formalization.' Going all the way to pure symbolic logic *throws away* semantic information; pure language lacks structure; the winners enrich natural language with selective symbolic scaffolding, gaining 4–8% by keeping both Why does partial formalization outperform full symbolic logic?. And where full formalization is needed, the fix is architectural division of labor: let the LLM draft the symbolic representation but hand inference to a deterministic solver that returns machine-verifiable error messages — a feedback loop that catches translation errors far better than the model critiquing itself Can symbolic solvers fix how LLMs reason about logic?. That last point reframes the whole question. The reason LLMs struggle to translate language into logic may be the same reason they can't self-correct their way out of it: reliable formalization needs an external verifier, because the model has no internal symbolic ground truth to check itself against What stops large language models from improving themselves?.
Sources 10 notes
LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.