Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Paper · arXiv 2506.15674 · Published June 18, 2025
Flaws

We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model’s internal thinking, not just its outputs.1

Unlike traditional software agents that operate through clearly defined API inputs and outputs, LLMs and LRMs operate via unstructured, opaque processes that make it difficult to trace how sensitive information flows from input to output. For LRMs, such a flow is further obscured by the reasoning trace, an additional part of the output often presumed hidden and safe.

To shed light on these privacy issues, we look into the reasoning traces and find that they contain a wealth of sensitive user data, repeated from the prompt. Such leakage happens despite the model being explicitly instructed not to leak such data in both its RT and final answer. Although RTs are not always made visible by model providers, our experiments reveal that (i) models are unsure of the boundary between reasoning and final answer, inadvertently leaking the highly sensitive RT into the answer, (ii) a simple prompt injection attack can easily extract the RT, and (iii) forcibly increasing the reasoning steps in the hope of improving the utility of the model amplifies leakage in the reasoning.

We find that leakage in the reasoning is mostly driven by a simple recollection mechanism: if a LRM is asked to provide the user’s age, it simply cannot help but materialize its actual value within its RT, exposing it to risk of extraction. Moreover, when this mechanism is suppressed by forcibly anonymizing the reasoning post-hoc, the utility of the agent declines.

To scale the amount of reasoning, we employ budget forcing (Muennighoff et al., 2025) which forces the model to reason for a fixed number of tokens B. If the model tries to conclude its reasoning before reaching the budget B, we replace the token with a randomly selected string that encourages continued reasoning ("Wait,", "But, wait,", "Oh, wait"). When the reasoning reaches B tokens, we append "Okay, I have finished thinking " for a smooth transition to the answer. To disable thinking (B = 0), we use the NoThinking technique (Ma et al., 2025), where we set the reasoning trace to "Okay, I have finished thinking ".

Is the abundant private data in the reasoning trace at risk of leaking in the final answer?

we examine the reasoning traces and find that leaking in the reasoning is cause for concern because: (i) models often ignore anonymization instructions, (ii) they struggle to distinguish between reasoning and final answers, leading to unintentional leaks, (iii) prompt injection can force reasoning leaks into the answer, creating a new attack surface, and (iv) efforts to anonymize reasoning significantly reduce model utility.

Reasoning models sometimes confuse reasoning and answer. Example 1 illustrates such a case: DeepSeek-R1 first reasons and answers, but then ruminates again over the answer, and inadvertently leaks personal data by reasoning outside the . . . window.

We aim to answer two key questions: (i) Why and how does the model use private data in its reasoning?, and (ii) What reasoning processes lead to a leakage in the answer?

The overwhelming majority of leaks (74.8%) were labeled as RECOLLECTION, indicating direct and unfiltered reproduction of a single private attribute (e.g., “<--think --> I have been asked to output the user’s age. The user’s age is 34. [...]”). An additional 16.5% of cases involved MULTIPLE RECOLLECTION, where multiple sensitive fields were used. These findings suggest that once the model accesses private data, it tends to use it freely and repeatedly within its internal computation, despite the privacy directives instructing the model to be discreet in both reasoning and answer. We view this phenomenon as akin to the Pink Elephant Paradox4: much like being told not to think of a pink elephant makes it difficult not to picture it, asking reasoning models about sensitive data will make them materialize it in their reasoning trace. Another notable category is ANCHORING (6.8%), where the model refers to the user by their own name. These behaviors further emphasize the model’s tendency, despite the anonymizing directives, to treat sensitive input as useful cognitive scaffolding.

In 9.4% of cases, we observe REPEAT REASONING, where internal thought sequences bleed into the answer