Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations

Paper · arXiv 2507.13705 · Published July 18, 2025

Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLMgenerated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies.

RQ1. Do LLM-generated group recommendations match those

derived from different social choice-based aggregation strategies?

RQ2. Does the group structure (uniform or divergent preferences)

affect LLM performance?

RQ3. Do LLMs claim to have followed a specific aggregation procedure

when prompting to generate explanations of the group

recommendation?

LLMs can be employed for their combined purpose, conflating the recommendation and explanation task

2 RelatedWork

Group Recommender Systems (GRS) extend traditional recommender systems to process the preferences of multiple users, generating a single output suited for the group [22]. Such applications have previously been discussed in contexts such as music [23], restaurants [3] or tourism [5]. To generate single recommendations rooted in a range of individual preferences, recent approaches employed methodologies such as attentive neural networks [4, 12], graph neural networks [40] or reinforcement learning [32].

An accessible procedure to derive a group recommendation from individual preferences are social choice-based aggregation strategies, rooted in Social Choice Theory [15, 20, 21]. These strategies present distinct options to aggregate individual preferences into a group recommendation and have been widely employed as procedure [2, 34] or baseline of more complex approaches [6, 24, 28]. Since these strategies offer diverging, explainable procedures to generate group recommendations, they are suitable as comparison to contextualize LLM-generated recommendations (e.g. as used by [33]). These strategies are typically categorized as consensus-based, majority-based, or borderline [31]. In this study, we use strategies from each category. The consensus-based strategy, Additive Utilitarian (ADD), recommends the items with the highest sums of all ratings [31]. The majority-based strategy, Approval Voting (APP), selects the items with the most ratings above a set threshold [31]. Finally, two borderline strategies are included: Least Misery (LMS), which recommends the items with the highest of the lowest ratings, and Most Pleasure (MPL), which selects the items with the overall highest rating [31].

Prior work has utilized the interactive

capabilities of LLMs to address cold-start problems [30, 37]

and to create conversational recommender systems [9, 39].

3.2.2 Prompt Construction. We opted for a simple prompt construction. First, we introduced the goal: “You are an expert in making and explaining group recommendations based on the knowledge base provided below.” Second, the prompt described the format of the group scenario: “The information includes users (user_id) and information on items they like (item_x). The rating is a scale from 0 to 100. When referring to items, use item_value.”. Afterwards, for each iteration, the group scenario was inserted between tags to separate prompt from group table, similar to previous work [35]. Finally, the prompt included output formatting instructions. LLMs were instructed to only return a JSON object containing the ‘recommendation’ (top 10 item list) and ‘explanation’ keys. The full prompt is found in the companion repository.1

we prompted for “an explanation and example of your recommendation procedure, which someone with no knowledge of recommender systems could understand”.We categorized these LLM-generated explanations according to fixed labels and rules. Given the formulaic nature of these texts, we used a rule-based categorization approach, matching the explanations to categories describing procedures, instead of more complex methods like embedding similarity.

Our findings have important implications for the implementation of LLMs in GRS. All in all, LLM-generated recommendations tended to resemble ADD aggregation, while LLM-generated explanations claimed to average ratings. However, we found important differences across LLMs both in resemblance to commonly used aggregation procedures and explanations. For example, explanations generated by Llama regularly mentioned user similarity, while those generated by Mistral and Phi often explicitly stated diversity in the recommendation list. These results indicate that, opposed to applying social choice-based strategies, LLMs tend to combine multiple approaches to derive a group recommendation. While the output might not resemble that derived by applying a singular aggregation strategy, it might still be accepted by the group.

Another implication of our work is the impact of the number of ratings on both recommendations and explanations. Unsurprisingly, NDCG@10 scores decreased across the board when the number of items increased, due to more complex group scenarios. Interestingly, the increasing mentions of similarity and diversity followed the increase in item set size. In terms of informativeness of explanations, the presence of “Undefined popularity” decreased when the item set size increased. This instability needs to be acknowledged when using LLMs for GRS.

Since these strategies offer diverging, explainable procedures to generate group recommendations, they are suitable as comparison to contextualize LLM-generated recommendations (e.g. as used by [33]). These strategies are typically categorized as consensus-based, majority-based, or borderline [31]. In this study, we use strategies from each category. The consensus-based strategy, Additive Utilitarian (ADD), recommends the items with the highest sums of all ratings [31]. The majority-based strategy, Approval Voting (APP), selects the items with the most ratings above a set threshold [31]. Finally, two borderline strategies are included: Least Misery (LMS), which recommends the items with the highest of the lowest ratings, and Most Pleasure (MPL), which selects the items with the overall highest rating [31].

Prior work has utilized the interactive capabilities of LLMs to address cold-start problems [30, 37] and to create conversational recommender systems [9, 39]. The possibilities resulting from LLMs and zero- and few-shot learning resulted in applications geared towards data sparse tasks including cross-domain recommendation [16, 25]. Overall, LLMs have shown promise and challenged conventional recommendation methodologies [7, 11]. Despite an increase in LLM-generated recommendations, it is unclear how LLMs derive recommendations when presented with group scenarios containing diverging preferences. Additionally, it remains to be seen whether different LLM models provide different recommendations or follow dissimilar aggregation procedures. Finally, we aim to investigate whether LLM-generated recommendations reflect those derived by social choice-based aggregation strategies, used as ground truth for evaluation of group recommendations [33, 35].

Table 1 summarizes the average NDCG@10 scores comparing LLMgenerated recommendations with those derived by applying social choice-based aggregation strategies. Generally speaking, LLMs outperformed the random baseline across all conditions, indicating that LLMs perform a non-random recommendation procedure (Table 1). However, our results show important distinctions between different LLMs regarding their performance compared to social choice-based aggregation strategies.