Self-Improving Model Steering

Paper · arXiv 2507.08967 · Published July 11, 2025
Self Refinement Self Consistency FeedbackReinforcement Learning

Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.

Preference optimization. Reinforcement learning from human feedback (RLHF) has emerged as a prominent approach for learning human preferences [31, 22]. RLHF first trains a reward model on preference data using established frameworks (e.g., the Bradley-Terry model [17]), and applies RL algorithms (e.g., PPO [39]) to optimize LLMs with respect to the reward model. Recent work [36, 51] shows the feasibility of bypassing the explicit reward modeling and directly solving the underlying RL problem. Further, SRSO [29] unifies the losses of DPO [36] and SLiC [51], offering an improved estimate of the optimal policy. This work extends previous research on preference optimization into challenging scenarios where externally annotated data is unavailable or impractical to obtain, addressing a critical gap in current work.

LLM self-improvement. Recent work [3, 9, 41, 42, 5, 46, 33, 44] demonstrates the potential of enhancing LLM performance through self-improvement. By enabling models to generate, judge, and refine their own outputs, self-improvement has shown effectiveness across alignment, instruction-following, and preference modeling, often matching or surpassing fine-tuning while greatly reducing human annotation effort and exposure to harmful content. A spectrum of self-improvement methods have been proposed, including synthetic preference generation [9, 22], tree-search refinement [4, 25], Nash equilibrium-based optimization [47], execution-guided verification [8], and iterative self-evolved reward modeling [16], which differ primarily in the mechanisms and granularity of self-generated feedback, ranging from internal judgment and strategic refinements to external execution validation. This work, to our best knowledge, represents the first exploration of this self-improvement paradigm in the context of model steering.

4.1 Self-Improving Model Steering

At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling learning the steering function from LLMs’ own behaviors without external supervision. At each iteration t, the current steering policy πt−1 processes a mini-batch of N prompts sampled from the question distribution Dq (without requiring ground-truth answers that would necessitate human annotation). For each prompt, the policy produces K candidate responses. A preference oracle O, which could be an existing reward model or even πt−1 itself acting as its own evaluator, is queried to yield an ordering over the K responses. These preference judgments define positive

(D+

t ) and negative (D−

t ) sample buffers that pair each prompt with its preferred or disfavored outputs,

respectively, creating contrastive training signals.

The language modelMis then executed on both positive samples (xi, y+

i ) from D+

t and the negative

samples (xi, y−

i ) from D−

t . We collect layer-wise activations to construct two activation sets, H+

l

and H−

l , as defined in Eq. 3. We leverage an existing steering-function learner A (e.g., HPR [34])

to update the steering functions {fl, f′

l }Ll

=1, which linearly or non-linearly shift model activations toward preferred behaviors while repelling undesirable ones. By composing the updated steering functions with the base modelM, we derive the refined policy πt for the next iteration, effectively solving the optimization problem in Eq. 5. The above process is iteratively repeated to progressively refine the steering functions. Because SIMS bootstraps its training signal entirely from its own generated outputs, it decouples model steering from externally annotated data and can be extended through an arbitrary number of iterations T. Under mild assumptions about oracle accuracy, the policy sequence {πt}Tt =0 constitutes monotonic improvement in expected preference reward. Crucially, each update operates only on sub-token activations rather than modifying full model weights, thereby maintaining computational efficiency compared to full-scale fine-tuning.