What breaks when specialized AI models reach real users? · Gravity7

Domain-Specific Failure Modes

3 notes

Why do language models struggle with historical legal cases?

Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

User Simulation and Evaluation

1 note

Why do LLM user simulators fail to track their own goals?

LLM-based user simulators drift away from assigned goals during multi-turn conversations, producing unreliable reward signals for agent training. Understanding this goal misalignment problem is critical because it undermines the entire RL training pipeline.

Personalized Alignment and Domain Adaptation

4 notes

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

How do personalization granularity levels trade precision against scalability?

LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.

Do user outputs outperform inputs for LLM personalization?

Does a user's history of outputs (responses, endorsed content) matter more for personalization than their input queries? This explores what actually drives effective personalization in language models.

Can personas evolve in real time to match what users actually want?

Explores whether a persona that bridges memory and action can adapt during conversations by simulating interactions and optimizing against user feedback, without retraining the underlying model.

Agentic Deployment and Adaptation

5 notes

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Can API calls outperform UI navigation for agent task completion?

Can agents work faster and more accurately by calling APIs directly instead of clicking through user interfaces? This explores whether changing how agents interact with applications solves latency and error problems that plague current LLM-based systems.

Can AI systems design unique multi-agent workflows per individual query?

Explores whether meta-agents trained with reinforcement learning can automatically generate personalized multi-agent system architectures tailored to individual user queries, rather than applying fixed task-level templates uniformly.

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Routing and Model Selection

3 notes

Can routers select the right model before generation happens?

Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.

What decisions must multi-agent routing systems optimize simultaneously?

Standard LLM routing only picks which model to use. But multi-agent systems involve four interdependent choices: topology, agent count, role assignment, and per-agent model selection. Does optimizing all four together actually improve performance?

Can routing beat building one better model?

Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.

Production Agentic Workflows

3 notes

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

Why do protocol-based tool systems fail in production agentic workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines reliable agent execution, and what causes ambiguous tool selection in production systems.

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Generative AI Interaction Design

4 notes

How should users control systems with unpredictable outputs?

When generative AI produces different outputs from identical inputs, how do interaction design principles help users maintain control and develop effective mental models for stochastic systems?

Can language models discover what users actually want from activity logs?

Users pursue month-long interest journeys that transcend individual item clicks. Can LLMs extract these persistent goals from behavioral patterns, and does this change how we should think about personalization?

Do generated interfaces outperform text-based chat for most tasks?

Explores whether LLMs should create interactive UIs instead of text responses, and under what conditions users prefer dynamic interfaces to traditional conversational chat.

When should human-agent systems ask for human help?

Explores the timing problem in collaborative AI systems: since there's no objective metric for optimal interruption, how can we design deferral mechanisms that know when to involve humans without constant disruption or silent failures?

Deployment Failure Modes

2 notes

Does better summary writing actually increase user engagement?

When AI systems generate more informative push notifications, do users engage more? This explores whether informativeness and engagement always align in real product contexts.

Why do LLM meeting summaries fail to help individuals?

Current LLM summarization treats all meeting participants the same, but organizational contexts require personalized recaps. What barriers prevent systems from learning what matters to each person?

Psychology-AI Interdisciplinary Gaps

2 notes

Why do AI researchers cite only narrow psychology pathways?

LLM research engages psychology through surprisingly limited citation routes—dominated by CBT, stigma theory, and DSM. This note explores what psychology domains are being overlooked and what risks that creates.

Can we control personality in language models without prompting?

Can lightweight adapter modules enable continuous, fine-grained control over psychological traits in transformer outputs independent of prompt engineering? This explores whether architecture-level personality modification outperforms prompt-based approaches.

Pass 3 Additions (2026-05-03)

5 notes

What architectural choices actually improve recommender system performance?

This exploration examines which design patterns and model structures consistently outperform alternatives in recommender systems. Understanding what works in practice matters because academic benchmarks often miss real-world constraints like latency and cold-start problems.

Does vibe coding actually keep humans in the loop?

Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.

Where do vibe coding students actually spend their debugging time?

When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

How can LLM agents handle huge candidate lists without breaking?

ReAct agents fail when retrieval tools return hundreds of items that overflow prompts. What architectural changes let LLMs work effectively with large candidate sets in recommendation systems?

Related Areas

8 notes

How do you build domain expertise into general AI models?

When LLMs are trained on everything, they excel at nothing. This explores the core trade-off: how to inject deep domain knowledge without creating brittle specialists that fail outside their niche.

How do you add domain expertise without losing general reasoning?

Exploring the tension between injecting specialized knowledge and preserving a model's broad problem-solving ability. Five distinct approaches exist, each with different trade-offs in cost, flexibility, and reliability.

Why do AI agents fail to take initiative?

Explores why the most capable AI models are structurally passive and what design changes could enable them to lead conversations, collaborate proactively, and identify missing information rather than simply respond to user prompts.

Why do AI conversations reliably break down after multiple turns?

Explores why multi-turn conversations degrade in quality and coherence. Understanding failure modes—intent misalignment, memory management, and missing grounding mechanisms—is essential for designing more resilient dialogue systems.

How does test-time scaling work at the agent level?

This explores how agents can spend compute at inference time across reasoning, interaction, and coordination. It examines whether multi-agent systems succeed through intelligent coordination or simply through token spending.

How accurately can language models simulate human personalities?

Can LLMs reliably replicate how specific people think and act? Understanding persona simulation fidelity matters because these models are increasingly used for research, personalization, and behavioral prediction—but systematic distortions may hide beneath surface accuracy.

How do you build domain expertise into general AI models?

When LLMs are trained on everything, they excel at nothing. This explores the core trade-off: how to inject deep domain knowledge without creating brittle specialists that fail outside their niche.

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.