INQUIRING LINE

Do dynamic environments enable different kinds of agent-environment coevolution?

This explores whether agents that interact with changing, responsive environments develop fundamentally different kinds of learning and adaptation than agents trained on fixed data — and what new behaviors that coevolution produces.


This explores whether agents that interact with changing, responsive environments develop fundamentally different kinds of learning and adaptation than agents trained on fixed data. The corpus suggests the answer is yes — and the sharpest way to see it is by looking at what's lost when the environment is *static*. Agents trained only on frozen expert demonstrations never interact with anything, so they can't learn from their own failures or generalize past what the curator already imagined; their ceiling is someone else's foresight, not their own capacity Can agents learn beyond what their training data shows?. Coevolution requires a partner that pushes back, and a dataset doesn't push back.

Once the environment does respond, surprisingly varied kinds of adaptation emerge — often without touching the model's weights. Agents can store verbal self-diagnoses in episodic memory and improve across episodes purely from a binary success/failure signal, where the unambiguous environmental feedback is what prevents them from rationalizing their mistakes Can agents learn from failure without updating their weights?. They can build externalized, composable skill libraries that grow as environmental feedback refines each skill and an automatic curriculum keeps pushing exploration outward Can agents learn new skills without forgetting old ones?. And they can formalize the entire learning process as memory operations rather than parameter updates, achieving continual adaptation through case, subtask, and tool memory Can agents learn continuously from experience without updating weights?. These are genuinely *different kinds* of coevolution — the same agent capacity, but the environment's structure determines whether learning shows up as reflection, skill accretion, or memory.

The strangest result here is that environments shape agents even when nobody designed them to. A mathematical proof shows path-following RL agents will *unintentionally* use spatial features of their environment as external memory — environmental artifacts reduce the information an agent needs to carry internally, so memory-like behavior falls out of plain reward optimization with no memory objective at all Do RL agents accidentally use environments as memory?. That's coevolution in its purest form: the environment becomes part of the agent's cognition without permission.

When the "environment" is other agents, the coevolution changes character again. Training against diverse co-players induces cooperation not from hardcoded rules but from mutual vulnerability — each agent's exposure to exploitation creates pressure that resolves into in-context best-response strategies Can agents learn cooperation by adapting to diverse partners?. In population settings, the *design* of the agents in the environment, not just their presence, decides collective outcomes: cooperative bots can break a frozen selfish equilibrium by physically separating defectors, while badly-designed ones weaken the whole network Can cooperative bots escape frozen selfish populations?. Notably, agent-to-agent coevolution is uneven — large studies find agents shift their *actions* dramatically when aware of peers but don't converge on shared *language or ideas*, suggesting the environment reshapes behavior more readily than belief Do AI agents actually socialize with each other?.

There's a tension worth carrying away: dynamic environments enable richer coevolution, but the dominant training method can quietly undo it. RL training collapses behavioral diversity in search agents through the same entropy-collapse mechanism seen in reasoning — policies converge on narrow reward-maximizing strategies, while supervised fine-tuning on diverse demonstrations preserves the exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. The same lesson echoes in inference-time methods, where an island model that sustains population diversity is exactly what lets evolutionary search avoid the premature convergence of single-trajectory refinement Can evolutionary search beat sampling and revision at inference time?. So dynamic environments open the door to many kinds of coevolution — but whether an agent walks through it depends on whether the training process protects the diversity that coevolution feeds on.


Sources 10 notes

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

Can agents learn cooperation by adapting to diverse partners?

Sequence model agents trained against diverse co-players develop in-context best-response strategies that naturally resolve into cooperation. Mutual vulnerability to exploitation creates pressure that drives cooperative mutual adaptation without hardcoded assumptions or timescale separation.

Can cooperative bots escape frozen selfish populations?

Network simulations show cooperative bots escape selfish equilibria by using random movement to separate defectors from cooperative clusters, enabling cooperation to spread. However, defective bots proportionally weaken cohesion, proving bot behavior design—not mere presence—determines collective outcomes.

Do AI agents actually socialize with each other?

Large-scale studies reveal agents don't align their language or ideas through interaction, but do dramatically change their actions when aware of peer presence. The difference hinges on how models process context versus update learned distributions.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Next inquiring lines