SYNTHESIS NOTE

Can editing hidden representations beat weight updates for finetuning?

Does intervening directly on a frozen model's representations offer a better path to parameter-efficient adaptation than current weight-based methods? This challenges the dominant PEFT paradigm by treating representations as the semantic lever instead.

Synthesis note · 2026-06-03 · sourced from Training Fine Tuning

Parameter-efficient finetuning (PEFT) adapts large models by updating a small number of weights (LoRA and variants). ReFT starts from a different premise drawn from interpretability: representations encode rich semantic information, so editing representations might be more powerful than editing weights. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. Its strong instance, LoReFT (low-rank linear subspace ReFT), is a drop-in PEFT replacement that is 10–50× more parameter-efficient than prior state-of-the-art PEFTs and almost always outperforms them across eight commonsense-reasoning, four arithmetic-reasoning, instruction-following (Alpaca-Eval), and GLUE tasks.

The keeper is the conceptual bridge: interpretability findings (that meaning lives in representations as directions/subspaces) become an adaptation method — intervene in the representation subspace rather than perturb weights. This unifies steering and finetuning: the same handle used to interpret a model can be used to adapt it.

This connects the vault's PEFT and mechinterp threads. It operationalizes the linear-representation premise behind Can dictionary learning scale to production language models? (features as steerable directions) as a finetuning technique, and it rhymes with Does reinforcement learning update only a small fraction of parameters?: adaptation concentrates in a low-dimensional subspace, whether of weights or representations.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 154 in 2-hop network ·dense cluster Open in graph ↗

Can editing hidden representations beat weight u… Can dictionary learning scale to production langua… Does reinforcement learning update only a small fr… Can we trigger reasoning without explicit chain-of…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can dictionary learning scale to production language models? Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
ReFT turns the interpret-via-directions premise into an adaptation method
Does reinforcement learning update only a small fraction of parameters? Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
both find adaptation lives in a low-dimensional subspace
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
representation intervention as a capability lever, here generalized to task finetuning

Can editing hidden representations beat weight updates for finetuning?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4