Can LLMs explain recommenders by mimicking their internal states?
Can training language models to align with both a recommender's outputs and its internal embeddings produce explanations that are both faithful and human-readable? This explores whether dual-access interpretation solves the fundamental tension between behavioral accuracy and interpretability.
Conventional explainability for recommenders trains a separate surrogate model to mimic the target's predictions and reads off feature importance from the surrogate. This works at a behavioral level — the surrogate predicts what the target predicts — but doesn't probe internal mechanism. It's a black-box explanation of a black-box.
RecExplainer's three-tier alignment scheme bridges this gap. Behavior alignment is the conventional surrogate: feed the LLM user profile text and train it to predict the items the target recommender would suggest. The LLM learns to reproduce target predictions from textual input.
Intention alignment goes deeper. Instead of giving the LLM only text, it incorporates the target recommender's neural-layer activations (the embeddings of users and items in the target's latent space) into the LLM's prompt. The LLM is fine-tuned to understand these embeddings as a multimodal input — text and recommendation-model embeddings are two modalities. Predictions now leverage the target's internal representation, not just its outputs.
Hybrid alignment combines both: text and embeddings together. The LLM produces explanations that integrate the human-interpretable reasoning the text supports and the high-fidelity behavior matching the embeddings provide.
The general principle: when you need to interpret a black-box model, behavioral mimicry and internal-state inspection are complementary. Each alone is partial — behavioral mimicry misses the mechanism, internal inspection misses the human-readable explanation. Combining them produces explanations that are both faithful to the target and intelligible to users. The pattern generalizes beyond recommendation: any model interpretation problem benefits from this dual access.
Source: Recommenders LLMs
Related concepts in this collection
-
Do LLM explanations faithfully describe their recommendation process?
When LLMs recommend items to groups, do their explanations match how they actually made the choice? This matters because users trust explanations to understand AI decision-making.
tension with: RecExplainer tries to align LLM-explainer behavior with the underlying model — exactly the alignment LLM-as-explainer fails by default
-
Can retrieval enhancement fix explainable recommendations for sparse users?
When users have few historical interactions, embedded recommendation models struggle to generate personalized explanations. Can augmenting sparse histories with retrieved relevant reviews—selected by aspect—overcome this fundamental data limitation?
complements: surrogate-model interpretability and aspect-aware retrieval are alternative answers to the explainable-recommendation problem
-
Can attention mechanisms reveal which user taste explains each recommendation?
Single-vector user models collapse diverse tastes into one representation, losing expressiveness. Can weighting multiple personas by item relevance surface the right taste at the right time while making recommendations traceable?
complements: persona-attention explains via the recommender's own structure; RecExplainer trains an external LLM to mimic — different routes to interpretability
-
Does processing ease mislead users about their own competence?
When AI generates polished output, do users mistake the fluency of that output as evidence of their own understanding or skill? This matters because it could systematically inflate self-assessment across millions of AI interactions.
tension with: LLM-generated explanations are fluent regardless of fidelity — the trust risk is that surrogate output reads as authoritative even when alignment fails
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
RecExplainer uses LLM as surrogate model with three alignment methods — behavior intention and hybrid for recommendation interpretability