Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models
Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as BEST-OF-N Sampling and MAJORITY VOTING have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.
In recent years, most state-of-the-art large language models (LLMs) are accessible only through black-box APIs. Traditional alignment methods that require access to model weights or logits are therefore infeasible. To address this issue, prompt optimization-based alignment methods have garnered interest (Chang et al. 2024). These methods typically enhance input prompts by rewording or appending additional instructions to better align the models’ outputs with a task’s objectives. Another broadly applicable alignment strategy for black-box models is scaling inference computations using strategies such as BEST-OF-N Sampling or MAJORITY VOTING. These inference scaling methods generate multiple candidate responses for the same query and select the final response via ranking or voting mechanisms (Wang et al. 2022; Krishna et al. 2022; Gui, Gˆarbacea, and Veitch 2024; Yue et al. 2025).
Although existing prompt optimization techniques have achieved substantial success, they are typically agnostic to how model outputs are aggregated or sampled, overlooking the impact of such inference methods. Our initial empirical investigation reveals that the performance of optimized prompts is highly sensitive to the choice of inference-scaling approach. Furthermore, our theoretical analysis reveals that decoupling prompt optimization from inference can lead to misalignment. Finally, we observe that optimal alignment requires careful consideration of user-specific preferences regarding the trade-offs among multiple objectives, as well as the computational resources they are willing to expend. These findings expose a critical gap in current methods: the absence of a unified framework that simultaneously accounts for prompt optimization, inference-scaling strategies, user preferences, and computational resource constraints. To bridge this gap, we introduce IAPO (Inference-Aware Prompt Optimization), a novel prompt optimization framework designed explicitly to produce aligned responses from inference-scaled black-box LLMs. IAPO simultaneously optimizes prompt design and inference scaling strategies while considering different task objectives and computational budgets. We formulate the task of identifying an optimal policy for the IAPO framework as a contextual best-arm identification (BAI) problem. To efficiently solve this, we propose a fixed-budget training algorithm named PSST (Prompt Scaling via Sequential Trimming). Additionally, we introduce a warm-up heuristic that further improves performance within the training budget.
We begin our analysis by deriving theoretical finitebudget guarantees on the error probability of PSST. Next, we empirically demonstrate the effectiveness of PSST for learning IAPO policies across six diverse tasks, including multiobjective text generation, mathematical reasoning, and commonsense reasoning benchmarks. Additionally, our analysis shows that ignoring inference scaling during prompt optimization can lead to substantial misalignment, highlighting the critical role of inference-awareness in aligning black-box LLMs.
Comparison of Exploration Strategies (Fig. 3). PSST and the Top-K screening heuristic consistently outperform all baselines. Across all six domains, where the per-context action spaces are large (|P|Nmax ∈ [640, 1536]), UCB, softmax, and ε-greedy methods struggle to explore effectively. Among the baselines, UCB performs comparably in some domains after T = 20K, but only with extensive hyperparameter tuning. Furthermore, these baselines are fully sequential and cannot leverage the cost and computational efficiency benefits of batch exploration. Full PSST attains the best final performance across four settings, while PSST+KX typically reaches strong policies faster, matching or exceeding PSST on three of the four real-data tasks when the budget is small. Under aggressive pruning (small K), however, the heuristic becomes suboptimal—most notably on summarization and on the synthetic benchmarks—suggesting that PSST+KX is attractive under tight budgets, whereas full PSST is preferable for critical tasks such as long-horizon, high-frequency deployment. Finally, the statistical test also validates that PSST, along with Top-K screening, significantly outperforms baselines in all six datasets and under nearly all budgets. These findings indicate that our approach reliably discovers well-aligned solutions using as few as 5K to 20K inference calls in practical settings.
Importance of Inference-Awareness (Fig. 4). We examine the role of inference awareness in prompt optimization. Across all six datasets, IAPO methods markedly outperform the inference-agnostic methods, demonstrating the gains achievable when jointly optimizing the prompt and inference scale. TRIPLE(N = 1) fails as it does not leverage inference scaling. On the other hand, TRIPLE(N = Random) fails because it does not optimize the scaling for different contexts. The screening variant PSST+K1—which effectively approximates a near-decoupled (prompt-only) procedure— fails to reach the optimum in most cases, performing competitively only on COMMONSENSEQA and showing pronounced underperformance on summarization. This is because it gets stuck with deceiving prompts that fail to scale compared to prompts that may not perform well under single-shot but improve significantly under scaling. These findings underscore the essential role of IAPO in aligning black-box LLMs and the pitfalls of disjoint optimization. Overall, IAPO outperforms disjoint optimization by up to 25% and prompt-only optimization by up to 50%.