How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?

This explores how an evolutionary loop — scoring candidates by a fitness signal, then having an LLM recombine the better ones — actually works when the solutions are free-form text or plans rather than neatly structured objects, and where that fitness signal is doing the real steering.

This reads the question as: in LLM-driven evolution, the model proposes and mutates candidates, but selection — picking which survive to be recombined — is what supplies the optimization pressure the LLM lacks on its own. The corpus is unusually pointed on why that division of labor matters. The headline result is Mind Evolution Can evolutionary search beat sampling and revision at inference time?, which uses LLM-generated mutations and crossovers inside a genetic algorithm and solves ~98% of planning tasks, beating Best-of-N and sequential revision. The key isn't smarter mutation — it's that an island model keeps a diverse population alive so fitness can keep sorting variants instead of collapsing to one trajectory. Fitness-proportional selection is the thing that prevents the premature convergence single-path refinement falls into.

Why lean so hard on external selection? Because the corpus suggests LLMs can't run the optimization internally. Models don't actually execute iterative numerical methods — they pattern-match to memorized templates and emit plausible-but-wrong values Do large language models actually perform iterative optimization?, and they plateau around 55–60% constraint satisfaction regardless of scale or 'reasoning' training Do larger language models solve constrained optimization better?. Even RL fine-tuning sharpens template-matching rather than installing a real search procedure Do fine-tuned language models actually learn optimization procedures?. So the evolutionary framing is a workaround: the LLM supplies cheap, varied proposals; the fitness function supplies the iteration the model can't do in its own latent space. Selection is the optimizer; the LLM is the mutation operator.

Now the twist hiding in the phrase 'unstructured solution spaces.' The corpus pushes back on the idea that selection alone can guide recombination through a formless space. Genesys, a multi-agent LLM system doing genetic programming over neural architectures, found that imposing a structured representation lifted design success from 14% to nearly 100% versus letting the LLM generate freely Can AI systems discover better neural architectures than humans?. The same fitness signal works dramatically better when crossover operates on structured genomes rather than raw text — meaning 'unstructured' is often the enemy, and part of what makes these systems succeed is quietly adding structure so that recombination produces valid offspring rather than noise.

There's also the question of where fitness comes from at all. Selection needs a scalar to be proportional to, and the corpus says that scalar is the real bottleneck: autonomous optimization only works in domains that supply immediate scalar metrics, modular structure, and fast iteration What makes a research domain suitable for autonomous optimization?. When no clean fitness exists, you can manufacture one — AlphaLLM uses tree search outcomes and critic models to rank solution paths without human labels, letting structure substitute for an annotation oracle Can tree search replace human feedback in LLM training?, and other work has LLMs design the reward/shaping signal itself before optimizing against it Can LLMs design reward functions for reinforcement learning?.

The deepest reframing: selection-plus-recombination-plus-diversity isn't an LLM trick at all — it's just evolution, and the corpus argues diffusion models are *mathematically* evolutionary algorithms, with denoising performing exactly the selection-mutation-isolation triad and outperforming classical evolutionary methods precisely because it preserves multimodality instead of collapsing to a single answer Can diffusion models perform evolutionary search in parameter space?. Read together, the notes tell a coherent story: fitness-proportional selection guides LLM recombination by doing the optimization the model can't, but it only works when something keeps the population diverse and something else gives the solutions enough structure for crossover to mean anything.

Sources 9 notes

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can AI systems discover better neural architectures than humans?

Genesys, a multi-agent LLM system using genetic programming and a Ladder of Scales verification process, discovered 1,062 novel architectures, with top designs outperforming GPT-2 and Mamba-2 on 6 of 9 benchmarks. Structured GP representation proved critical, improving design success from 14% to nearly 100% versus direct LLM generation.

What makes a research domain suitable for autonomous optimization?

Autonomous research pipelines require immediate scalar metrics, modular architecture, fast iteration cycles, and version control. Domains lacking any property resist autoresearch regardless of LLM capability, because the bottleneck is environmental structure, not model power.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

Can diffusion models perform evolutionary search in parameter space?

Denoising in diffusion models performs selection, mutation, and reproductive isolation—the core mechanisms of evolution. Diffusion Evolution empirically outperforms mainstream evolutionary algorithms by preserving multimodality where traditional methods collapse to single solutions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM systems researcher re-testing whether fitness-proportional selection remains a bottleneck or constraint in LLM-guided evolutionary search (question posed circa 2024–2025).

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable:
• LLMs plateau at 55–60% constraint satisfaction regardless of scale; they pattern-match memorized templates rather than execute iterative optimization internally (~2024).
• Fitness-proportional selection + population diversity (island models) solves ~98% of planning tasks, outperforming Best-of-N by delegating optimization to the fitness signal while LLM supplies mutations (~2024).
• Structured genome representation lifts neural-architecture design success from 14% to ~100% vs. unstructured free generation; crossover must operate on valid syntax, not raw text (~2024).
• Selection only works in domains with immediate scalar metrics, modularity, and fast iteration; absent these, LLMs can design reward shaping or use tree search + critics to manufacture fitness (~2024–2025).
• Diffusion models are mathematically evolutionary algorithms (denoising = selection + mutation + isolation) and preserve multimodality better than classical EA (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.12253 (2024-04) — Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
• arXiv:2410.02543 (2025-10) — Diffusion Models are Evolutionary Algorithms
• arXiv:2501.09891 (2025-01) — Evolving Deeper LLM Thinking
• arXiv:2603.23004 (2026-03) — Can Large Language Models Reason and Optimize Under Constraints?

Your task:
(1) RE-TEST EACH CONSTRAINT. Does recent work on RL post-training (arXiv:2505.11711, 2025-05), subnetwork fine-tuning, or language-model-based search (arXiv:2405.15194, 2024-05) show that LLMs *do* learn iterative search or constraint reasoning *inside* latent space, thereby reducing reliance on external fitness? Or does the 55–60% ceiling still hold? Separately: do recent diffusion–LLM hybrids or multi-modal genomes (beyond structured text) weaken the "structure is essential" finding, or reinforce it?
(2) Surface contradicting work: flag any 2025–2026 papers showing that unstructured recombination *without* explicit fitness ranking still achieves high performance, or that fitness-proportional selection underperforms simpler baselines (e.g., uniform sampling, age-based turnover).
(3) Propose two research questions that *assume* the regime may have shifted: (a) If LLMs can now learn to execute constrained search via RL subnetwork updates, what is the *relative role* of internal search vs. external selection—are they complementary or redundant? (b) If diffusion-as-evolution truly outperforms classical GA+LLM, does that imply fitness-proportional selection is itself the constraint, rather than the solution?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?

Sources 9 notes

Next inquiring lines