Agentic Systems and Planning

Can LLMs learn reliably at test time without human oversight?

How can language models adapt to rapidly changing rules and knowledge during inference rather than waiting for retraining? What prevents fully autonomous systems from handling conflicting information?

Note · 2026-05-18 · sourced from Reinforcement Learning
How should agents split planning from visual grounding? What actually constrains large language models from self-improvement?

Many real-world domains have rules that change faster than training cycles can keep up with — regulatory compliance, user risk screening, evolving customer policies. Standard remedies fail at the boundary: offline fine-tuning lags reality; ICL provides examples but cannot integrate novel rules; RAG retrieves but cannot reconcile contradictions; test-time fine-tuning adjusts parameters but cannot easily handle conflict between old and new knowledge.

ARIA (2507.17131) proposes that test-time learning requires three things existing methods do not combine: (1) structured uncertainty self-assessment, (2) a timestamped knowledge base with conflict detection, and (3) HITL clarification queries when contradictions surface. Each component does specific work that the others cannot.

Structured self-dialogue is the uncertainty assessor. Rather than relying on confidence thresholds — which are notoriously poorly calibrated — ARIA generates a reflective Q&A about its own preliminary judgment: questioning implicit assumptions, recalling related prior experiences, identifying domain-knowledge gaps. This converts confidence assessment into an inspectable reasoning trace that surfaces what kind of uncertainty exists (factual gap, procedural ambiguity, conflict with prior rule). The structure is what makes the assessment more reliable than a scalar confidence score.

Timestamped knowledge repository is the conflict detector. Each acquired knowledge item is stored with its acquisition timestamp. When new knowledge arrives, ARIA retrieves related entries by semantic matching and compares them against the new information. Inconsistencies are flagged. Older entries are marked as potentially obsolete rather than deleted — preserving history while signaling currency.

Active clarification queries are the conflict resolver. When contradictions surface, ARIA does not silently choose one version or fail. It generates targeted queries back to human experts: "rule X dated 2025-03 said Y; new guidance dated 2025-09 implies not-Y; please clarify." The human-mediated resolution is the load-bearing step — the system does not attempt to autonomously adjudicate between conflicting rules.

The deeper claim is about what kind of learning AGI needs. Strong-AI fantasies of fully autonomous adaptation collapse on the conflict-resolution problem. When old and new knowledge disagree, no purely autonomous system can reliably pick the right resolution because the choice depends on context outside the system (regulatory authority, organizational priority, expert judgment). ARIA accepts this constraint and designs the human-mediated loop as a first-class component rather than a fallback.

For deployment, this is the architectural pattern for any system operating in a rapidly-changing domain where the cost of acting on outdated rules is high.


Paper: Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Related concepts in this collection

Concept map
15 direct connections · 146 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

test-time learning requires structured self-dialogue plus timestamped knowledge base with conflict-resolution queries