Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Paper · arXiv 2507.17131 · Published July 23, 2025
Reinforcement Learning

Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches—like offline fine-tuning and standard prompting—are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA)1, an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents.

The challenge lies in endowing agents with the capacity for continuous learning and adaptation directly during their deployment (at test time). To bridge this gap, we propose the Adaptive Reflective Interactive Agent (ARIA), a general-purpose framework designed to enable effective LLM learning at test time through structured self-assessment and human-in-the-loop interactions. Specifically, ARIA utilizes structured self-evaluation to detect its gaps or uncertainties, proactively engages human experts for targeted guidance, and systematically integrates obtained human knowledge through a carefully managed internal knowledge base organized by timestamps and flagging mechanisms for outdated or conflicting information.

Intelligent Guidance Solicitation. Rather than using fixed heuristics or confidence scoring thresholds, ARIA initiates a structured internal questionand- answer self-dialogue. Upon producing an initial preliminary judgment, ARIA responds to reflective questions about the clarity and reliability of its reasoning, identifying implicit assumptions, questioning whether it possesses suitable domain knowledge, and recalling prior related experiences. This approach clearly highlights knowledge gaps and uncertainties, which directly motivates targeted human assistance.

Human-Guided Knowledge Adaptation. After identifying knowledge uncertainties, ARIA proactively solicits support and receives guidance— corrections, detailed explanations, or updated rules—from human domain experts. It incorporates these human-provided knowledge inputs into a structured knowledge repository that marks each knowledge item with timestamps. Whenever a new knowledge update occurs, ARIA retrieves related entries by semantic matching in its repository and compares them against the new information. If inconsistencies or contradictions between new and old knowledge appear, ARIA adjusts the status of outdated rules, clearly marking them as potentially obsolete. To efficiently maintain consistency, ARIA also generates active clarification queries back to human experts, resolving detected contradictions and ensuring up-to-date accuracy.

While we demonstrate ARIA’s effectiveness within the context of name screening tasks, it is conceived as a general framework. Its core principles— enabling test-time learning through reflective uncertainty assessment and structured integration of human guidance—are applicable to a wide range of domains. Any task requiring strong, evolving domain-specific knowledge where human expertise is available and valuable for ongoing refinement could benefit from this approach, particularly those operating in rapidly changing environments such as legal document review, complex customer support, or scientific discovery assistance.

For LLMs, common approaches include incontext learning (ICL) or few-shot learning, where the model learns from examples provided within the prompt (Brown et al., 2020; Min et al., 2021; Wang et al., 2023; Hou et al., 2023; He et al., 2025b,c; Sui et al., 2024b; Chen et al., 2025b,c), and retrieval-augmented generation (RAG), which incorporates external knowledge retrieved based on the input (Lewis et al., 2020; Dong et al., 2022; He et al., 2024). Other methods involve test-time fine-tuning, adjusting model parameters specifically for each incoming prompt (Hübotter et al., 2024; Akyürek et al., 2024; Sui et al., 2025). In the agent context, self-learning agents aim to improve autonomously through environmental interaction (Liu et al., 2025a; Gao et al., 2025; Chen et al., 2025a). While existing methods like ICL, RAG, test-time fine-tuning, and autonomous selflearning offer some adaptability, ARIA distinctively establishes a human-mediated continuous learning loop, focusing on structured knowledge integration, conflict resolution via human clarification, and persistent adaptation of an evolving knowledge base at test time.