Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation

Paper · arXiv 2507.13525 · Published July 17, 2025
Recommenders PersonalizedPrompts PromptingAssistants Personalization

Large language models (LLMs) can perform recommendation tasks by taking prompts written in natural language as input. Compared to traditional methods such as collaborative filtering, LLM-based recommendation offers advantages in handling cold-start, crossdomain, and zero-shot scenarios, as well as supporting flexible input formats and generating explanations of user behavior. In this paper, we focus on a single-user setting, where no information from other users is used. This setting is practical for privacy-sensitive or datalimited applications. In such cases, prompt engineering becomes especially important for controlling the output generated by the LLM. We conduct a large-scale comparison of 23 prompt types across 8 public datasets and 12 LLMs. We use statistical tests and linear mixed-effects models to evaluate both accuracy and inference cost. Our results show that for cost-efficient LLMs, three types of prompts are especially effective: those that rephrase instructions, consider background knowledge, and make the reasoning process easier to follow. For high-performance LLMs, simple prompts often outperform more complex ones while reducing cost. In contrast, commonly used prompting styles in natural language processing, such as step-by-step reasoning, or the use of reasoning models often lead to lower accuracy. Based on these findings, we provide practical suggestions for selecting prompts and LLMs depending on the required balance between accuracy and cost.

Recommender systems are widely used in services such as online advertising, e-commerce, and video streaming. Traditional approaches, such as collaborative filtering, predict user behavior by learning patterns from many users’ histories [7, 18, 25]. These methods work well when there is a large amount of training data. However, their performance decreases in cold-start situations [54, 66, 67] where the system has not seen the target items or users before, and also in cross-domain situations [22, 36, 62] where the system is applied to a different domain from its original training domain, such as recommending books after learning from movie data.

Prompt engineering is also used in recommendation, although it does not always lead to better accuracy. For example, techniques developed in natural language processing (NLP), such as step-bystep reasoning [23] and rephrasing [12], have been effective in tasks like question answering and summarization. Recommendation tasks, however, require different types of reasoning that emphasize the relationship between users and items. Several studies have examined prompts in both general NLP [42, 47] and recommendation [9, 15, 19, 20, 31, 48, 49, 57, 64]. Research on recommendation has mostly focused on prompt types proposed in the RecSys field. It often includes few techniques from NLP and covers a limited range of datasets or LLMs. As a result, the conclusions from these studies may not generalize well to broader recommendation settings.

To address these gaps, we conduct a broad evaluation of prompts for LLM-based recommendation. Our study compares 23 prompt types, 8 real-world datasets, and 12 LLMs, which is much larger in scale than previous studies.

𝑡sys = (default_sys_inst)


𝑡𝑢1 = # Requirements: you must rank candidate products that will be

provided below to the target user for recommendation. # Observation:

{1: {'title': 'The Bourne Legacy Steelbook', 'categories': 'Blu-ray,

Movies', 'description': "NOTICE: Polish Release, cover may ...", '

rating': 5.0, 'review': 'awesome movie'}, 2 : (omitted) }

(candidate_info)

(preamble), rephrase and expand it to help you do better answering.

Maintain all information in the original question.


𝑡𝑎1 = The task at hand is to evaluate and rank a list of candidate

products based on their alignment with the preferences of a target

user. The user has previously shown a strong interest in actionpacked

films, particularly those in the Bourne series, as well as

comedies with high ratings. The ranking should reflect the following

considerations: 1. The presence of action elements, particularly in

the context of the Bourne series. 2. (omitted)

Now, let's proceed to rank the candidate products based on these

criteria. (omitted)

Based on the analysis of these products, the final ranking will be

provided in the specified format. Now, let's rank the products. ###

Ranked List:1. The Bourne Identity (Action) - Strong alignment

with the user's interest in the Bourne series.(omitted)10. **Cooking

French, Vol. 2** (Special Interests) - Least relevant to the user's

interests. ### Final Output:[4, 9, 8, 6, 1, 3, 2, 7, 10, 5]


𝑡𝑢2 = (final_inst)


𝑡𝑎2 = [4, 9, 8, 6, 1, 3, 2, 7, 10, 5]

𝑡sys = (default_system_inst)


𝑡𝑢1 = (user_info + candidate_info)

Our final goal is to provide an answer to the following problem. Before tackling this issue, rephrase and expand it to help you do better

answering. Maintain all information in the original question.


𝑡𝑎1 = The task at hand is to evaluate and rank a list of candidate products based on their alignment with the preferences of a target user. The

user has previously shown a strong interest in action-packed films, particularly those in the Bourne series, as well as comedies with high

ratings. The goal is to identify which of the new candidate products best match these interests and rank them accordingly.

The candidate products provided include a mix of genres, including action, comedy, and drama. Each product has a title, category, and

description, which will help determine how closely they align with the user's preferences.

The ranking should reflect the following considerations:

  1. The presence of action elements, particularly in the context of the Bourne series.

  2. High ratings and positive reviews, as the user has previously favored products with a perfect rating of 5.0.

  3. The relevance of the genre to the user's interests, with a focus on action and comedy.

After analyzing the candidate products, I will provide a ranked list indicating the order of preference for the user.

Now, let's proceed to rank the candidate products based on these criteria.

Here are the candidate products to be ranked:

  1. Seven Year Itch VHS (Comedy)

(omitted)

  1. Love Streams (Drama)

Based on the analysis of these products, the final ranking will be provided in the specified format.

Now, let's rank the products.

Ranked List:

  1. The Bourne Identity (Action) - Strong alignment with the user's interest in the Bourne series.

(omitted)

  1. Cooking French, Vol. 2 (Special Interests) - Least relevant to the user's interests.

Final Output:

[4, 9, 8, 6, 1, 3, 2, 7, 10, 5]


𝑡𝑢2 = (final_inst)

𝑡sys = (default_system_inst)


𝑡𝑢1 = (user_info + candidate_info)

Our final goal is to provide an answer to the following problem. Before tackling this issue, please follow this format to proceed step by step

with Observation, Thought, and Action:

(Repeat for multiple items if necessary)

Finally, provide your Answer.


𝑡𝑎1 = ### Observation:

The user has shown a strong preference for action-packed movies, particularly those in the Bourne series, as well as high-rated comedies like "

Tyler Perry's A Madea Christmas." All items in the user's history have a perfect rating of 5.0 and are well-reviewed, indicating a preference

for quality content.

Thought:

The user likely enjoys thrilling action films and comedies with strong narratives and engaging characters. Given this, I will prioritize

candidate products that are action-oriented or highly rated comedies.

Action:

I will examine the candidate items one by one to determine their relevance to the user's preferences.

  1. Seven Year Itch VHS: Comedy, but lacks a strong description or notable acclaim.

(omitted)

  1. Love Streams: Drama, not aligned with the user's preferences.

Observation:

Thought:

The top candidates based on the user's preferences are "The Bourne Identity," "Jack Reacher Combo," and "Underworld: Awakening." "The Bourne

Identity" is the most aligned with the user's established preferences, followed by "Jack Reacher Combo" and then "Underworld: Awakening."

Answer:

[4, 9, 8]


𝑡𝑢2 = (final_inst)

𝑡sys = (default_system_inst)


𝑡𝑢1 = (user_info + candidate_info)

Our final goal is to provide an answer to the ranking item problem. Before tackling this issue, please consider the principles and theories

behind the first question.


𝑡𝑎1 = To rank the candidate products based on the user's preferences, we will focus on the following criteria:

  1. Genre and Themes: The user has shown a preference for action and thriller movies, particularly those in the Bourne series, which are

characterized by intense action and espionage themes.

  1. Ratings: Higher-rated products (5.0 ratings) are prioritized.

  2. Relevance: Products that are part of the same franchise or have similar themes to the Bourne series will be ranked higher.

Given these criteria, let's analyze the candidate products:

  1. The Bourne Identity (Product 4): This is directly related to the Bourne series, which the user has shown a strong preference for. It has

a similar action and espionage theme.

(omitted)

  1. Love Streams (Product 10): This is a drama and does not fit the user's action-oriented preferences.

Based on this analysis, the ranked list of candidate products is as follows:

[4, 8, 9, 2, 3, 6, 7, 1, 5, 10]


𝑡𝑢2 = Thank you! As an expert, what do you think about the above answers? Please provide feedback so that more accurate predictions can be made

in the future.


𝑡𝑎2 = The ranking provided is a solid attempt to align the candidate products with the user's preferences based on the information available.

However, there are several areas for improvement to enhance the accuracy of future predictions:

  1. User Preference Analysis: While the analysis focused on action and thriller genres, it could benefit from a deeper understanding of the

user's overall preferences. For instance, if the user enjoys a mix of genres, including comedy or drama, those products could be ranked higher.

(omitted)

  1. Feedback Loop: Implementing a feedback mechanism where users can indicate their satisfaction with recommendations could help refine

future predictions. This would allow for continuous learning and improvement based on user responses.

By incorporating these considerations, future rankings can become more nuanced and tailored to the user's preferences, leading to more accurate

and satisfying recommendations.


𝑡𝑢3 = (final_inst)