How do LLMs translate informal prose into logically correct formal specifications?

This explores autoformalisation — whether LLMs can take everyday prose and convert it into formal logic or specifications that are not just well-formed but actually mean what the prose meant.

This explores autoformalisation: turning informal prose into formal logic that is logically *correct*, not merely syntactically valid. The corpus delivers a sharp split here. LLMs are good at producing logic that *looks* right and bad at producing logic that *is* right — Can large language models translate natural language to logic faithfully? finds models generate well-formed expressions that are semantically wrong, with errors clustering predictably at scope ambiguity, quantifier precision, and predicate granularity. Intriguingly, the failure is asymmetric: models seem to understand formal language better than they can generate it. So the bottleneck isn't reading logic — it's committing prose to a single faithful logical form.

Why does the translation break down? Two adjacent notes point at root causes. First, prose is often ambiguous on purpose, and LLMs can't see it: Can language models recognize when text is deliberately ambiguous? shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans, and cannot hold multiple interpretations at once. Formalisation *forces* a choice of interpretation, so a model blind to ambiguity will confidently formalise the wrong reading. Second, the underlying machinery isn't symbolic at all — Do large language models reason symbolically or semantically? finds that when meaning is decoupled from the task, performance collapses even with correct rules in hand. The model is pattern-matching on semantics, not manipulating symbols, which is exactly the skill faithful formalisation demands.

The most useful counter-move in the corpus is to *not fully formalise*. Why does partial formalization outperform full symbolic logic? reports that selectively enriching natural language with symbolic elements (QuaSAR, Logic-of-Thought) beats both pure prose and full formalisation by 4–8%. Full translation throws away semantic information the model still needs; partial abstraction keeps the prose's richness while adding just enough structure. The lesson runs against intuition: the path to *more* logical correctness can be *less* complete symbolisation.

Two more notes suggest scaffolding that helps. Forcing explicit reasoning steps surfaces hidden premises — Can structured argument prompts make LLM reasoning more rigorous? shows structured critical-question prompting makes models check warrants and backing they'd otherwise skip, the same implicit-premise gaps that wreck a formalisation. And the capability isn't absent: Can language models actually analyze language structure? finds reasoning models can build syntactic trees and phonological generalisations through step-by-step analysis. So the raw analytic ability exists — it just has to be deliberately invoked rather than assumed.

The thing you may not have expected to learn: the failure mode isn't that LLMs can't speak logic. They can. It's that informal prose is underdetermined, and a system that reasons by smooth semantic association rather than symbolic commitment will paper over the ambiguity instead of resolving it — producing fluent, valid, and quietly wrong specifications. That's why the best results come from keeping the human-readable text in the loop, not replacing it.

Sources 6 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

How do LLMs translate informal prose into logically correct formal specifications?

Sources 6 notes

Next inquiring lines