← All notes

What breaks when specialized AI models reach real users?

How domain-specialized AI models are deployed, routed, and experienced in production workflows and failure modes.

Topic Hub · 40 linked notes · 11 sections
View as

Domain-Specific Failure Modes

3 notes

Why do language models struggle with historical legal cases?

Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.

Explore related Read →

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Explore related Read →

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

Explore related Read →

User Simulation and Evaluation

1 note

Personalized Alignment and Domain Adaptation

4 notes

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

Explore related Read →

How do personalization granularity levels trade precision against scalability?

LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.

Explore related Read →

Do user outputs outperform inputs for LLM personalization?

Does a user's history of outputs (responses, endorsed content) matter more for personalization than their input queries? This explores what actually drives effective personalization in language models.

Explore related Read →

Can personas evolve in real time to match what users actually want?

Explores whether a persona that bridges memory and action can adapt during conversations by simulating interactions and optimizing against user feedback, without retraining the underlying model.

Explore related Read →

Agentic Deployment and Adaptation

5 notes

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Explore related Read →

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Explore related Read →

Can API calls outperform UI navigation for agent task completion?

Can agents work faster and more accurately by calling APIs directly instead of clicking through user interfaces? This explores whether changing how agents interact with applications solves latency and error problems that plague current LLM-based systems.

Explore related Read →

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Explore related Read →

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Explore related Read →

Routing and Model Selection

3 notes

Can routers select the right model before generation happens?

Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.

Explore related Read →

What decisions must multi-agent routing systems optimize simultaneously?

Standard LLM routing only picks which model to use. But multi-agent systems involve four interdependent choices: topology, agent count, role assignment, and per-agent model selection. Does optimizing all four together actually improve performance?

Explore related Read →

Can routing beat building one better model?

Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.

Explore related Read →

Production Agentic Workflows

3 notes

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

Explore related Read →

Why do protocol-based tool systems fail in production agentic workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines reliable agent execution, and what causes ambiguous tool selection in production systems.

Explore related Read →

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Explore related Read →

Generative AI Interaction Design

4 notes

How should users control systems with unpredictable outputs?

When generative AI produces different outputs from identical inputs, how do interaction design principles help users maintain control and develop effective mental models for stochastic systems?

Explore related Read →

Can language models discover what users actually want from activity logs?

Users pursue month-long interest journeys that transcend individual item clicks. Can LLMs extract these persistent goals from behavioral patterns, and does this change how we should think about personalization?

Explore related Read →

Do generated interfaces outperform text-based chat for most tasks?

Explores whether LLMs should create interactive UIs instead of text responses, and under what conditions users prefer dynamic interfaces to traditional conversational chat.

Explore related Read →

When should human-agent systems ask for human help?

Explores the timing problem in collaborative AI systems: since there's no objective metric for optimal interruption, how can we design deferral mechanisms that know when to involve humans without constant disruption or silent failures?

Explore related Read →

Deployment Failure Modes

2 notes

Does better summary writing actually increase user engagement?

When AI systems generate more informative push notifications, do users engage more? This explores whether informativeness and engagement always align in real product contexts.

Explore related Read →

Why do LLM meeting summaries fail to help individuals?

Current LLM summarization treats all meeting participants the same, but organizational contexts require personalized recaps. What barriers prevent systems from learning what matters to each person?

Explore related Read →

Psychology-AI Interdisciplinary Gaps

2 notes

Why do AI researchers cite only narrow psychology pathways?

LLM research engages psychology through surprisingly limited citation routes—dominated by CBT, stigma theory, and DSM. This note explores what psychology domains are being overlooked and what risks that creates.

Explore related Read →

Can we control personality in language models without prompting?

Can lightweight adapter modules enable continuous, fine-grained control over psychological traits in transformer outputs independent of prompt engineering? This explores whether architecture-level personality modification outperforms prompt-based approaches.

Explore related Read →

Pass 3 Additions (2026-05-03)

5 notes

What architectural choices actually improve recommender system performance?

This exploration examines which design patterns and model structures consistently outperform alternatives in recommender systems. Understanding what works in practice matters because academic benchmarks often miss real-world constraints like latency and cold-start problems.

Explore related Read →

Does vibe coding actually keep humans in the loop?

Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.

Explore related Read →

Where do vibe coding students actually spend their debugging time?

When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.

Explore related Read →

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Explore related Read →

How can LLM agents handle huge candidate lists without breaking?

ReAct agents fail when retrieval tools return hundreds of items that overflow prompts. What architectural changes let LLMs work effectively with large candidate sets in recommendation systems?

Explore related Read →