Why do LLMs fail at faithful autoformalisation of reasoning problems?

This explores why LLMs stumble when translating a stated problem into a faithful formal representation (logic, symbols, precise structure) — what breaks in the move from natural language to a faithful formalization, and what the corpus suggests fixes it.

This reads the question as being about faithful autoformalization — the act of turning a messy natural-language problem into a clean symbolic representation that actually preserves what the problem meant. The corpus doesn't have a paper with "autoformalisation" on the cover, but it triangulates the failure from several angles, and they converge on one diagnosis: LLMs translate by semantic association, not by symbolic commitment.

The sharpest evidence is that models reason through meaning, not form. When researchers strip the familiar semantic content out of a task and leave only the logical structure, performance collapses even when the correct rules are sitting right there in the prompt Do large language models reason symbolically or semantically?. Formalization is exactly the operation that demands you ignore surface meaning and honor structure — so a system leaning on token associations and parametric commonsense is working against its own grain. A related study finds humans and LLMs both wobble along the same content-sensitivity axis, which suggests content-independence may not even be the right yardstick — but for faithful formalization specifically, content-dependence is the bug, not a feature Do language models fail reasoning tests that humans pass?.

Two other failure modes explain why the translation drops information. The "frame problem" work shows models routinely fail to bring unstated preconditions forward as explicit constraints — and forcing enumeration jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. A faithful formalization must surface precisely those hidden assumptions; if the model never enumerates them, the formal version is quietly incomplete. Worse, models accommodate false presuppositions baked into a problem even when they demonstrably know better, swallowing a bad premise instead of flagging it Why do language models accept false assumptions they know are wrong?. Both mean the model formalizes what was said rather than what was meant.

Then there's the gap between explaining and doing. "Potemkin understanding" describes models that give a correct account of a concept yet fail to apply it — explanation and execution running on disconnected pathways Can LLMs understand concepts they cannot apply?. The same split shows up as a knowing-doing gap, where models produce the right rationale 87% of the time but follow it only 64% Why do language models fail to act on their own reasoning?. A model can articulate the formal rules of a domain and still produce a formalization that violates them — and linguistic-structure work shows the errors compound predictably as syntactic depth grows, so the longer and more nested the problem, the more the translation degrades Why do large language models fail at complex linguistic tasks?.

The most hopeful thread is that you may not need the LLM to formalize perfectly on its own. Logic-LM divides the labor: the LLM drafts a symbolic representation, a deterministic solver executes the inference, and — crucially — the solver hands back machine-verifiable error messages that catch translation mistakes far better than the model's own self-critique Can symbolic solvers fix how LLMs reason about logic?. That matters because models hit a formal ceiling on self-improvement: reliable correction requires an external verifier, not more introspection What stops large language models from improving themselves?. So the corpus's quiet punchline is that faithful autoformalization fails for an architectural reason — semantic pattern-matching can't self-enforce symbolic fidelity — and the route through it isn't a smarter prompt but an external checker in the loop.

Sources 9 notes

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can symbolic solvers fix how LLMs reason about logic?

Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why do LLMs fail at faithful autoformalisation of reasoning problems?

Sources 9 notes

Next inquiring lines