PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Paper · arXiv 2507.04607 · Published July 7, 2025

Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit1, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.

With LLMs, three major paradigms have emerged: prompt engineering, retrieval-augmented generation, and training-based parameterization.

2.2 Memory Mechanism for LLM

Decades of psychological research have converged on the following human memory components: sensory register, short-term memory, and long-term memory (Atkinson and Shiffrin, 1968a). Regarding the durable long-term memory, further distinction has been made between episodic memory and semantic memory (Tulving et al., 1972; Tulving, 1985). Episodic memory refers to autobiographical events we can re-experience (Tulving, 2002; Clayton et al., 2007), e.g., recalling a specific conversation that happened last night. Semantic memory, on the other hand, refers to general facts and knowledge we have accumulated (Saumier and Chertkow, 2002; McRae and Jones, 2013), such as knowing that NLP stands for Natural Language Processing. In this work, we posit that the dual structure— episodic vs. semantic memories—is especially pertinent to LLM personalization, as it mirrors the difference between remembering what happened in a particular interaction (episodes), and knowing what is true about the users’ opinions, beliefs, and preferences (semantics). Integrating memory into LLM-based

A standard implementation of episodic memory is retrieval-based: past interactions (Park et al., 2023) and external facts (Yao et al., 2023) are indexed in a database and fetched on demand. In contrast, semantic memory is mostly realized parametrically:4 model’s parameters are updated by training on user data to embed user-level knowledge (Zhang et al., 2024b; Magister et al., 2024). Recent hybrid approaches attempt to combine these two by merely concatenating textual summaries with retrieved experiences (Tan et al., 2024; Zhong et al., 2024; Gupta et al., 2024), resulting in only superficial fusion. Recognizing the isolated usage and the shallow integration, we formulate a more principled approach that enables deep information flow between episodic and semantic memories, which enables the successful use of the newly proposed personalized thinking.

Episodic Memory Instantiation. The writing mechanism typically involves storing raw interaction data for efficiency and completeness. We thus focus on the reading mechanism, exploring several recall strategies, ϕ(·): 1) recall complete history (i.a., Shinn et al., 2023), 2) recall most recent histories (i.a., Wang et al., 2024), and 3) recall relevant histories (i.a., Park et al., 2023). Since full-history recall is intractable for long-context conversations, we focus our experiments on both recent and relevant recall. We also explore augmenting episodic memory with semantic memory–derived profile summaries (Richardson et al., 2023), referred to as textual-summary augmentation (TSA).7

Semantic Memory Instantiation. We first explore different instantiations of the memory-writing function, specifically focusing on deriving ΔH(a) by internalizing information from user history H(a), i.e., encoding abstract concepts (e.g., preferences) into semantic memory. There are two forms of personalized semantic memory: parametric (ΔH(a)) and textual forms. We provide a brief summary and Table A4 presents the input–output mappings for each instantiation.

Parametric form, ΔH(a)θ, encodes user preferences into the model’s parameters. We examine several training objectives:

• Input-Only Training: Suitable when human-written personalized outputs are unavailable (Tan

et al., 2024). Objectives include next token prediction

(NTP) and conditional input generation

(CIG), e.g., generate a post based on the title.

• Fine-Tuning (FT): The most common practice

to personalize model parameters (Zhang et al.,

2024b; Magister et al., 2024; Tan et al., 2024),

and we have two variants: output-oriented FT

(O-FT) and task-oriented FT (T-FT), depending

on whether end task information is handy.

• Preference Tuning: Alternative to RLHF, employs

methods like DPO (Rafailov et al., 2023)

and SIMPO (Meng et al., 2024), an efficient variant

without the need for the reference model,

to align model outputs with user preferences.

Although RLHF has been used to learn user preferences (Li et al., 2024), its simpler alternative, 7TSA concatenates the textual summary with recalled histories. Despite the hybrid memory usage, we classify it under Episodic Memory per our adopted memory dichotomy. preference tuning, remains largely unexplored for LLM personalization.

Textual form represents user preferences as text,

usually in the summary form. We explore:

• Hierarchical Summarization (HSumm): hierarchically

aggregates current interactions into concise

summaries (Zhong et al., 2024).

• Parametric Knowledge Reification (PKR): a

novel method that leverages a model, trained

on a user’s engagement history,8 to infer a concise

profile summary. PKR offers a speed gain

over HSumm.

During the memory reading process, as shown in Equation (2), if semantic memory is in parametric form, the model parameters are adjusted as θ + ΔH(a)θ; if in the textual form, ⊕ is implemented as prefixing the generated profile summary to the input query q.

For instantiations that involve training, we utilize LoRA (Hu et al., 2022) for its efficiency and interpretability, allowing ΔH(a)θ as an abstract state to represent user-specific preferences and beliefs.

Our experiments reveal that episodic memory grounded in simple recency often outperforms a semantic-similarity retrieval strategy—both in accuracy and speed—because the most recent interactions tend to be the strongest predictors of immediate user behavior. In contrast, semantic memory allows us to infer user preferences and latent traits even without task-specific labels, as validated by the improved performances achieved through input-only training. The best performance is reached by the task fine-tuning (T-FT), which directly learns the mapping from the input query to the final desired outcome. Surprisingly, preference-tuning methods underperform here, which deserves more investigation in the future. Overall, using semantic memory (SM) alone generally leads to higher performance compared to using episodic memory (EM) alone. This suggests that semantic abstraction of user preferences and history might be more effective for personalization than simply recalling specific interactions.

We are thus motivated to apply the slow thinking strategy to unlock personalized thinking

However, due to the fast thinking training paradigm (i.e., direct mapping from input to output), we find that fine-tuned LLMs have been turned into a specialist model and overfitted to the target space, i.e., losing the generalist capability of generating meaningful intermediate thoughts when prompted. A common error is repetition of tokens. To this end, we decide to unlock personalized thinking capabilities through training on synthesized personalized thoughts.

Capitalizing on the recent success of self-distillation (Zhang et al., 2019; Pham et al., 2022; Wang et al., 2023), we design the following algorithm to produce intermediate thoughts and feed them back to the model itself for learning the personalized thinking process.

7.1 Main results

Major results are included in Table 3, and the full results (across all 5 metrics) can be found in Table A3. Below are our major findings.

Generic Reasoning has limitations: Enabling generic chain-of-thought often underperforms the non-thinking baseline (see Table A3). The uncustomized reasoning trace merely scratches the surface, seeking broad answers rather than to-thepoint, user-specific responses. A detailed case study appears in Appendix E.
Semantic Memory (SM) Beats Episodic Memory (EM): Consistent with our major finding in Section 4, SM alone generally outperforms EM alone, regardless of the model size or family.
DUAL Often Underperforms SM Alone: Surprisingly, integrating both memory types without personalized thinking (DUAL) occasionally yields lower or comparable results than SM along. This suggests that potential conflicts between episodic and semantic memories could backfire if not properly mediated.
Model-agnostic Effectiveness: PRIME consistently enhances performance across all base models at different scales, illustrating that our PRIME framework is robust and model-agnostic.
Personalized Thinking is Crucial: By augmenting DUAL with personalized thinking, PRIME achieves superior performance over nearly all variants. This highlights the pivotal role of customized reasoning in improving personalization.

Noticeably, due to the data scarcity of limited active users on the CMV forum, two users with superficially similar posting histories may differ sharply in which replies they find compelling. Replacement with similar profiles thus misleads the model way more than any other incorrect profiles: PRIME confidently applies the wrong preferences.

Profile Replacement. To evaluate the extent to which PRIME faithfully leverages and ingest a user’s unique history, we perform a controlled “profile-replacement” experiment: for each test query, we substitute the target user’s engagement history——both episodic (e.g., textual profile summary) and semantic memories (LoRA weights)— with that of (i) the most similar user, (ii) a similar user, (iii) a mid-range user, or (iv) a dissimilar user.12 We also include a Self condition, retaining the user’s own history as a baseline.

We report average performance in Figure 2 and detailed breakdown (e.g., Hit@k) in Figure A1. For both evaluated models, performance is at peak when using the target user’s own profile (Self ). Replacement with any other user’s profile reduces performances, and more interestingly, the drop is steepest when the replaced profiles are similar to the original, and then partially recovers as the replacement profile becomes more divergent.

Why worst performance with similar profiles? Our PRIME learns fine-grained, user-specific preferences—effectively a dedicated bias towards which replies persuade that user most when evaluated on the CMV dataset. Noticeably, due to the data scarcity of limited active users on the CMV forum, two users with superficially similar posting histories may differ sharply in which replies they find compelling. Replacement with similar profiles thus misleads the model way more than any other incorrect profiles: PRIME confidently applies the wrong preferences.

8 Conclusion

Inspired by the cognitive dual-memory model, we first systematically study different memory instantiations and then propose PRIME, a unified framework that integrates episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability, yielding more accurate, user-aligned responses and richer reasoning traces. To assess long-context personalization, we introduce the CMV dataset and conduct extensive experiments, which demonstrate the effectiveness of both PRIME and personalized thinking. Finally, our further analysis confirms that PRIME shows strong fidelity to each user’s unique history.