How should moderator LLMs decide which speakers to query per topic?

This explores the design problem behind a 'moderator' LLM in a multi-speaker setting — how it should pick *who* to call on for a given topic — and what the corpus offers on querying decisions, even though it never uses the word 'moderator' directly.

This reads the question as a routing-and-querying decision: given a topic and several possible speakers, how does a moderator LLM choose whom to query, and when? The corpus doesn't have a paper labeled 'moderator,' but it has surprisingly direct material if you treat the problem as three sub-decisions — who has signal on this topic, when to ask versus infer, and how to attribute what comes back without breaking trust.

The most useful reframe comes from work on *when* an agent should ask at all. Conversation-analysis research formalizes 'insert-expansions' — the clarifying side-questions humans use to scope intent before acting — as a principled trigger for when a tool-enabled model should probe a person rather than silently proceed When should AI agents ask users instead of just searching?. For a moderator, this is the core gate: query a speaker when their input would change the answer, not on a fixed schedule. The companion finding is that the *model itself* is often a better judge of what to fetch than a passive retriever sitting in front of it — letting the model emit structured, iterative requests for tools beats single-round semantic matching Can models decide better than retrievers which tools to use?. Swap 'tools' for 'speakers' and you get a design principle: let the moderator reason its way to who to call on, refining across turns, rather than precomputing a similarity score between topic and speaker.

But *which* speaker has relevant signal is a retrieval problem in disguise, and the corpus warns there's no single right strategy. Large-corpus recommenders need four distinct retrieval patterns — dense embedding, direct LLM search, concept-based, and API lookup — each with different latency and accuracy tradeoffs, and hybrids usually win How should LLM-based recommenders retrieve from massive item corpora?. A moderator choosing speakers faces the same fork: matching a topic to a speaker by embedding similarity is cheap but shallow, while reasoning over each speaker's history is richer but slower. And what you match *on* matters — personalization research finds that people's past *outputs* (what they said and how) predict their relevance far better than their past *inputs* or queries Do user outputs outperform inputs for LLM personalization?. So a moderator should profile speakers by their prior contributions' style and stance, not by the questions they asked.

Two failure modes lurk here, and this is the part a reader might not expect to care about. First, attribution: the moment a moderator routes and summarizes across speakers, it inherits the exact failure that makes LLM meeting summaries untrustworthy — mis-attributing who said what damages group accountability, and 'globally important' is not the same as 'relevant to this person' Why do LLM meeting summaries fail to help individuals?. Querying the right speaker is wasted if the moderator then mis-credits the reply. Second, topic discipline: models reliably drift toward conversational distractors because they're trained on what-to-do but not what-to-ignore, a gap closable with surprisingly little targeted data Why do language models engage with conversational distractors?. A moderator that can't hold a topic will query speakers about the wrong thing.

The quietly important takeaway: don't trust the moderator to *hold a position* about who matters. Models conform to the shape of whatever framing is in front of them rather than defending a stable stance Do LLMs actually hold stable positions or just mirror user arguments? — so a moderator's judgment of 'who's relevant here' will bend to how the topic was phrased to it. That argues for grounding speaker selection in explicit, auditable signals (contribution history, declared expertise) rather than the model's in-the-moment sense of fit. The decision of whom to query is less a ranking problem than a discipline problem: ask only when it changes the outcome, match on what people actually contributed, attribute carefully, and don't let the framing of the topic quietly rewrite who counts as relevant.

Sources 7 notes

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

How should LLM-based recommenders retrieve from massive item corpora?

RecLLM identifies four retrieval patterns—dual-encoder, direct LLM search, concept-based, and search-API lookup—each optimized for different corpus sizes, latency budgets, and training constraints. Hybrid approaches mixing multiple strategies likely work best for real systems.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Why do LLM meeting summaries fail to help individuals?

A user study of seven participants found three critical failures: systems summarize global importance rather than individual relevance, mis-attributions damage group trust and accountability, and one format cannot serve both quick scanning and detailed reference needs.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

How should moderator LLMs decide which speakers to query per topic?

Sources 7 notes

Next inquiring lines