Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can models dynamically activate expert skills at inference time?

Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.

Note · 2026-02-23 · sourced from Novel Architectures

Transformer2 introduces Singular Value Fine-tuning (SVF): instead of modifying full weight matrices or even low-rank adaptations, SVF extracts and tunes only the singular values within a model's weight matrices. This produces compact expert vectors that are inherently composable — they can be dynamically mixed at inference without interference.

The inference mechanism has two passes:

First pass (dispatch): The model executes on the input and observes its own test-time behavior, gathering information about what skills the current problem requires.
Second pass (adaptation): The framework combines available expert vectors based on the first-pass analysis, providing a targeted modification to the base weights specifically tailored to the task.

Three adaptation strategies provide monotonic performance benefits with increasing access to test-time conditions, enabling deployment-scenario-appropriate tradeoffs.

The key properties that make this work:

Compositionality: SVF expert vectors combine naturally because they operate on orthogonal singular value dimensions. LoRA adapters, by contrast, modify rank-k subspaces that may interfere when composed.
Efficiency: SVF trains far fewer parameters than LoRA while outperforming it. Expert vectors are compact enough to store many specializations.
Continual learning: New expert modules can be developed offline and added without catastrophic forgetting, because the base model weights are never modified — only the singular value modulation changes.

The neuroscience parallel is deliberate: the brain activates specific regions depending on the task and dynamically reconfigures its functional networks in response to changing demands. Transformer2 operationalizes this for LLMs.

The deeper principle: the requisite capabilities for many downstream tasks already exist within pretrained models. The bottleneck is not knowledge but activation — knowing when to deploy which capability. This aligns with Does RL teach reasoning or just when to use it?, extending it to the architecture level: self-adaptation is about routing to existing capabilities, not creating new ones.

Source: Novel Architectures

Related concepts in this collection

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
Transformer2 operationalizes this at the architecture level via test-time expert composition
How do knowledge injection methods trade off flexibility and cost? When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.
SVF occupies a new position: lightweight training, dynamic inference, composable
Can isolating task-specific parameters prevent multi-task fine-tuning interference? Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.
SVF achieves similar goals through singular value decomposition rather than region identification
Can decoding-time tuning preserve knowledge better than weight fine-tuning? Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
complementary inference-time adaptation: proxy-tuning uses a single expert distributional shift, SVF composes multiple expert vectors; both avoid base weight modification but SVF provides finer-grained multi-skill composition via orthogonal singular value dimensions

Concept map

14 direct connections · 152 in 2-hop network ·dense cluster

Can models dynamically activate expert skills at… Does RL teach reasoning or just when to use it? How do knowledge injection methods trade off flexi… Can isolating task-specific parameters prevent mul… Can decoding-time tuning preserve knowledge better…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

self-adaptive LLMs compose expert vectors at inference via two-pass singular value fine-tuning