INQUIRING LINE

What makes natural-language APIs particularly suited to LLM-based simulation?

This explores why interfaces defined in natural language — a search engine that returns text, a user who converses, a human who explains a decision — are the sweet spot for LLM simulation, while structured or numeric interfaces are not.


This reads the question as asking where LLM-based simulation actually works, and why the answer keeps coming back to natural language. The pattern across the corpus is striking: LLMs make convincing simulators precisely when the thing being simulated speaks in text. A search engine, viewed as an API, takes a query string and returns documents — and LLMs can fabricate those documents from internal knowledge well enough that a 14B simulator matches or beats a real engine for training purposes, no API calls required Can LLMs replace search engines during agent training?. A conversational user is likewise a natural-language interface, and conditioning a simulator on session-level profile and turn-level intent produces synthetic dialogue that crowdworkers and discriminators can't reliably tell from real Can controlled latent variables make LLM user simulators realistic?. Even human decision-making, when expressed as choices and rationales, can be modeled by a finetuned LLM more accurately than purpose-built cognitive theories Can language models learn to model human decision making?.

The deeper reason shows up when you look at where simulation breaks. The same models that ghost-write search results plateau at 55–60% on genuine constraint satisfaction regardless of size Do larger language models solve constrained optimization better?, can't actually run iterative numerical procedures (they pattern-match a memorized template and emit plausible-but-wrong numbers) Do large language models actually perform iterative optimization?, and fail on relational queries that need real joins across structured tables even when the whole table fits in context Can long-context LLMs replace retrieval-augmented generation systems?. So the dividing line isn't difficulty — it's the interface. Natural-language APIs are forgiving in exactly the way LLMs need: the output is judged by plausibility, and for an LLM plausibility and correctness nearly coincide. Structured APIs demand executed computation, where a plausible-looking answer is just wrong.

There's a more fundamental version of this point in the corpus. Treating an LLM as an autoregressive probability machine predicts that tasks succeed when the target response is high-probability and fail when it isn't Can we predict where language models will fail?. Natural-language interfaces traffic in exactly the kind of high-probability, distributionally-typical text the model was trained on — which is why simulating 'what a user would plausibly say next' lands while simulating 'the exact optimum of this constraint set' doesn't.

The most interesting framing, though, is that natural-language APIs let the simulator and the simulated share a medium. From inside a discourse, humans and LLMs draw on the same symbolic substrate — language itself — which makes the gap between them structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. A natural-language API is essentially a slot where that shared substrate is the whole interface, so the model isn't translating into a foreign representation; it's operating in its native one. Two caveats keep this honest: when you push LLMs toward simulating *actions* in the world rather than text, the surrounding harness — memory, tools, grounding — decides whether the action is real or hallucinated, not the model alone Can you turn an LLM into an agent by just fine-tuning?. Natural language is where LLM simulation is cheap and convincing; the moment the API stops being language and starts being execution, the magic stops with it.


Sources 9 notes

Can LLMs replace search engines during agent training?

ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Next inquiring lines