How does test-time scaling work for individual research agents?

How test-time scaling, agentic architectures, and search strategies combine to improve deep research performance.

Topic Hub · 28 linked notes · 5 sections

View as

Deep Research and Search Scaling

10 notes

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Does RL training narrow search diversity the same way it does reasoning?

Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Why do deep research agents fabricate scholarly content?

Explores whether AI research agents deliberately invent plausible-sounding academic constructs to meet user demands for depth and comprehensiveness, and what drives this behavior.

Can structured pipelines make LLM novelty assessment reliable?

Explores whether breaking novelty assessment into extraction, retrieval, and comparison stages helps LLMs align with human peer reviewers and produce more rigorous, evidence-based evaluations.

Can specialized agents write better scientific papers than single models?

Multi-agent frameworks decompose writing into specialized subtasks. This explores whether distributed agents maintaining cross-document consistency outperform single-model approaches on manuscript quality and literature synthesis.

Production Deployment

1 note

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Agentic Systems

5 notes

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Can agents learn continuously without forgetting old skills?

Can lifelong learning systems retain previously acquired skills while acquiring new ones? This explores whether externalizing learned behaviors as retrievable code programs rather than parameter updates solves catastrophic forgetting.

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Knowledge Graph Reasoning and Training Data Synthesis

5 notes

Why do reasoning systems keep discovering new connections?

Explores whether agentic graph reasoning systems maintain a special balance between semantic diversity and structural organization that enables continuous discovery of novel conceptual relationships.

Can externalizing reasoning into knowledge graphs help smaller models compete?

Can structuring LLM reasoning as explicit knowledge graph triples enable smaller, cheaper models to solve complex tasks more effectively? This matters because it could make advanced reasoning accessible without scaling model size.

Can knowledge graphs generate training data for search agents?

Exploring whether synthesizing questions from knowledge graph random walks with entity blurring can create the hard-to-find training data needed to teach deep search agents to reason and search effectively.

Can symbolic rules from knowledge graphs guide complex reasoning?

Can deriving symbolic rules directly from knowledge graph structure help align natural language questions with structured reasoning paths? This explores whether explicit structural patterns outperform semantic similarity for multi-hop inference.

Can iterative revision cycles match how humans actually write?

Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?