Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem

Paper · arXiv 2602.21814 · Published February 25, 2026
Prompts PromptingEvaluationsFlaws

The car wash problem asks a simple question: “I want to wash my car. The car wash is 100 meters away. Should I walk or drive?” Every major LLM tested—Claude, GPT-4, Gemini— recommended walking. The correct answer is to drive, because the car itself must be at the car wash.

We ran a variable isolation study to determine which prompt architectural layer resolves this failure. Six conditions were tested, 20 trials each, on Claude Sonnet 4.5. A bare prompt with no system instructions scored 0%. Adding a role definition alone also scored 0%. A STAR reasoning framework (Situation, Task, Action, Result) reached 85%. User profile injection with physical context—car model, location, parking status—reached only 30%. STAR combined with profile injection reached 95%. The full stack combining all layers scored 100%. The central finding is that structured reasoning outperformed direct context injection by a factor of 2.83× (Fisher’s exact test, p = 0.001). STAR forces the model to articulate the task goal before generating a conclusion, which surfaces the implicit physical constraint that context injection leaves buried. The addition of a sixth condition resolved a confound in the original five-condition design by isolating per-layer contributions: STAR accounts for +85pp, profile adds +10pp, and RAG provides the final +5pp to reach perfect reliability.

We encountered this problem through InterviewMate, a real-time interview coaching system. During a routine test session, the system answered “drive” while every standalone LLM we tested said “walk.” We did not expect this. InterviewMate’s system prompt has multiple layers—role definition, a STAR reasoning framework, user profile data, and RAG context retrieval—and we had no way to tell which layer produced the correct answer. The result was interesting, but we could not explain it, and a result you cannot explain is not one you can build on. So we designed a variable isolation experiment. Instead of asking why LLMs fail at this problem—a question the Hacker News thread had already covered extensively—we asked which specific prompt layer fixes it within a single model.