INQUIRING LINE

What role does self-learning play in improving agent reasoning without annotation?

This explores how agents can improve their own reasoning by learning from their own experience and internal signals — rather than from human-labeled data or expert demonstrations.


This explores how agents can improve their own reasoning by learning from their own experience and internal signals, rather than from human-labeled data. The corpus frames this as a way around a real ceiling: when agents train only on static expert datasets, their competence is capped by what the curators imagined — they never interact with an environment, so they can't learn from their own failures or generalize past the demonstrated scenarios Can agents learn beyond what their training data shows?. Self-learning is the escape hatch from that cap, and the corpus describes several distinct ways to build the missing feedback signal in-house.

The most direct trick is to manufacture supervision from consequences. Agents can treat the future states produced by their own actions as the teaching signal — a 'third paradigm' between imitation and reinforcement learning that matches expert-dependent baselines with half the data and gives RL a better warm start Can agents learn from their own actions without external rewards?. Where there's any verifiable outcome, that signal can be made dense and annotation-free: information-theoretic rewards score each reasoning step's contribution to getting the answer right, matching hand-labeled process feedback without the labeling cost Can we reward reasoning steps without human annotation?. And where no external reward exists at all, self-play can fabricate a curriculum — a Challenger escalates difficulty, a neutral Judge issues binary verdicts as reward, and both sides co-evolve their skills in natural language, though it only works if adversarial pressure is balanced against a safeguard against collapse Can language models learn skills without human supervision?.

A striking thread is that much of this improvement doesn't require touching the model's weights at all. Reflexion has agents convert a simple success/failure signal into written self-diagnoses stored in episodic memory; the binary feedback prevents the model from rationalizing, and improvement accrues across episodes without any parameter updates Can agents learn from failure without updating their weights?. AgentFly generalizes this into a memory-augmented decision process where credit assignment and policy improvement happen entirely through memory operations, hitting 87.88% on GAIA without modifying the LLM Can agents learn continuously from experience without updating weights?. VOYAGER stores executable skills in a searchable library and composes complex behaviors from simpler ones, learning continuously while sidestepping the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. These memory systems even learn to manage themselves — autonomous memory folding compresses interaction history into structured schemas, cutting token overhead while preserving the details needed for reflection Can agents compress their own memory without losing critical details?.

Here's the part you might not expect, and it reframes the whole question. A cluster of findings argues that self-learning often isn't teaching new reasoning — it's *unlocking* reasoning the base model already has. Five independent methods (RL steering, critique fine-tuning, decoding changes, feature steering, RLVR) all elicit capability already latent in base-model activations, suggesting post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. The sharper version of this claim: RL post-training mostly teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. You can even skip training entirely: modular cognitive tools lifted GPT-4.1 on AIME2024 from 26.7% to 43.3% with no RL, purely by structuring how the latent capability gets called Can modular cognitive tools unlock reasoning without training?.

The honest caveat the corpus adds: self-learning leans on the model's signals about itself, and those signals are shaky. Models can describe their own learned behaviors but their self-reports are unstable, they shift beliefs under conversational pressure, and users over-trust confident outputs regardless of accuracy How well do language models understand their own knowledge?. That's why the methods that work best lean on *unambiguous* signals — binary environmental feedback, verifiable outcomes, action consequences — rather than the model's introspective judgment of its own reasoning. Self-learning without annotation works precisely when the feedback comes from the world, not from the model grading itself.


Sources 12 notes

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Next inquiring lines