Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Paper · arXiv 2508.09726 · Published August 13, 2025
Reinforcement Learning

Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length—inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely “filler”: repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training-time, we teach models to think less at inference-time.

Longer responses can appear less accurate simply because they often arise from harder questions. To disentangle genuine length increases driven by question difficulty from unnecessary inflation, we analyze the correlation between response length and correctness for multiple responses to the same questions in Phi-4-reasoning-plus (Abdin et al., 2025). On AIME 25, we find that in 72% of questions where both correct and incorrect responses are generated, longer responses are more likely to be wrong than their shorter counterparts.

Approaches such as Dr. GRPO (Liu et al., 2025) and DAPO’s (Yu et al., 2025) token-level loss normalization have been proposed to curb the persistent length inflation phenomenon in RLVR-trained models. Yet, even with token-level normalization applied during the training of Phi-4-reasoning-plus, we observe rapid response length growth—from 4k to 14k tokens in just 100 steps of GRPO training. We hypothesize that while token-level normalization penalizes long incorrect responses more heavily, it also amplifies rewards for long correct chains—unintentionally reinforcing the inherent verbosity of strong base models that have been heavily SFTed for step-by-step reasoning

Motivated by these observations, our goal is to develop efficient reasoning models—models that retain the reasoning accuracy afforded by GRPO while producing substantially shorter reasoning chains. Towards achieving this goal, we make the following contributions:

• GFPO (Group Filtered Policy Optimization): We propose GFPO (Figure 1, Section 3), a simple yet effective variant of GRPO designed to explicitly counteract response length inflation. GFPO combines rejection sampling with standard GRPO: for each question, we sample a larger group of candidate reasoning chains G to increase exposure to desirable outputs, filter them according to a target metric, and only learn from the policy gradients of the top-k retained chains. While many rejection metrics are possible, we focus on response length—retaining the shortest chains to encourage the model to “think less” while reasoning.