How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?
This explores how an evolutionary loop — scoring candidates by a fitness signal, then having an LLM recombine the better ones — actually works when the solutions are free-form text or plans rather than neatly structured objects, and where that fitness signal is doing the real steering.
This reads the question as: in LLM-driven evolution, the model proposes and mutates candidates, but selection — picking which survive to be recombined — is what supplies the optimization pressure the LLM lacks on its own. The corpus is unusually pointed on why that division of labor matters. The headline result is Mind Evolution Can evolutionary search beat sampling and revision at inference time?, which uses LLM-generated mutations and crossovers inside a genetic algorithm and solves ~98% of planning tasks, beating Best-of-N and sequential revision. The key isn't smarter mutation — it's that an island model keeps a diverse population alive so fitness can keep sorting variants instead of collapsing to one trajectory. Fitness-proportional selection is the thing that prevents the premature convergence single-path refinement falls into.
Why lean so hard on external selection? Because the corpus suggests LLMs can't run the optimization internally. Models don't actually execute iterative numerical methods — they pattern-match to memorized templates and emit plausible-but-wrong values Do large language models actually perform iterative optimization?, and they plateau around 55–60% constraint satisfaction regardless of scale or 'reasoning' training Do larger language models solve constrained optimization better?. Even RL fine-tuning sharpens template-matching rather than installing a real search procedure Do fine-tuned language models actually learn optimization procedures?. So the evolutionary framing is a workaround: the LLM supplies cheap, varied proposals; the fitness function supplies the iteration the model can't do in its own latent space. Selection is the optimizer; the LLM is the mutation operator.
Now the twist hiding in the phrase 'unstructured solution spaces.' The corpus pushes back on the idea that selection alone can guide recombination through a formless space. Genesys, a multi-agent LLM system doing genetic programming over neural architectures, found that imposing a structured representation lifted design success from 14% to nearly 100% versus letting the LLM generate freely Can AI systems discover better neural architectures than humans?. The same fitness signal works dramatically better when crossover operates on structured genomes rather than raw text — meaning 'unstructured' is often the enemy, and part of what makes these systems succeed is quietly adding structure so that recombination produces valid offspring rather than noise.
There's also the question of where fitness comes from at all. Selection needs a scalar to be proportional to, and the corpus says that scalar is the real bottleneck: autonomous optimization only works in domains that supply immediate scalar metrics, modular structure, and fast iteration What makes a research domain suitable for autonomous optimization?. When no clean fitness exists, you can manufacture one — AlphaLLM uses tree search outcomes and critic models to rank solution paths without human labels, letting structure substitute for an annotation oracle Can tree search replace human feedback in LLM training?, and other work has LLMs design the reward/shaping signal itself before optimizing against it Can LLMs design reward functions for reinforcement learning?.
The deepest reframing: selection-plus-recombination-plus-diversity isn't an LLM trick at all — it's just evolution, and the corpus argues diffusion models are *mathematically* evolutionary algorithms, with denoising performing exactly the selection-mutation-isolation triad and outperforming classical evolutionary methods precisely because it preserves multimodality instead of collapsing to a single answer Can diffusion models perform evolutionary search in parameter space?. Read together, the notes tell a coherent story: fitness-proportional selection guides LLM recombination by doing the optimization the model can't, but it only works when something keeps the population diverse and something else gives the solutions enough structure for crossover to mean anything.
Sources 9 notes
Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Genesys, a multi-agent LLM system using genetic programming and a Ladder of Scales verification process, discovered 1,062 novel architectures, with top designs outperforming GPT-2 and Mamba-2 on 6 of 9 benchmarks. Structured GP representation proved critical, improving design success from 14% to nearly 100% versus direct LLM generation.
Autonomous research pipelines require immediate scalar metrics, modular architecture, fast iteration cycles, and version control. Domains lacking any property resist autoresearch regardless of LLM capability, because the bottleneck is environmental structure, not model power.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.
Denoising in diffusion models performs selection, mutation, and reproductive isolation—the core mechanisms of evolution. Diffusion Evolution empirically outperforms mainstream evolutionary algorithms by preserving multimodality where traditional methods collapse to single solutions.