Reinforcement Learning for LLMs

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Note · 2026-02-22 · sourced from Training Fine Tuning
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Training-Free GRPO replaces the standard SFT-then-RL pipeline with iterative distillation of "experiential knowledge" — high-quality reasoning patterns extracted from rollout groups — that is prepended to prompts during API calls. Instead of GRPO's numerical group-relative advantages that update parameters, this method extracts group-relative semantic advantages — textual descriptions of what worked and why.

The knowledge serves as a "learned token prior": context that shifts the output distribution in the same direction that RL would shift parameters, but through the model's in-context learning capability rather than gradient updates. This achieves the same directional effect as fine-tuning (moving probability mass toward better outputs) through a completely different mechanism (conditioning on experiential context).

The practical advantages are significant: no parameter updates means no overfitting to small training sets (a persistent problem with RL on limited data), no need for GPU training infrastructure, and compatibility with black-box API-only models. With just a few dozen ground-truth training samples, Training-Free GRPO outperforms fine-tuned small LLMs and improves out-of-domain performance.

This challenges the assumption that RL-like behavioral changes require parameter modification. Since Can prompt optimization teach models knowledge they lack?, the experiential knowledge is not adding new capability but reorganizing how existing capability is expressed — the same function RL serves. The method is essentially automated prompt engineering guided by GRPO's selection logic but executed through in-context learning.

The connection to Can decoding-time tuning preserve knowledge better than weight fine-tuning? is direct: both achieve tuning-like effects without modifying the target model. Proxy tuning operates at the logit level; Training-Free GRPO operates at the prompt level. Both preserve the base model's knowledge intact.


Source: Training Fine Tuning

Related concepts in this collection

Concept map
16 direct connections · 160 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

experiential knowledge distilled as token prior achieves rl-like distribution shifts without parameter updates