SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper · arXiv 2501.17161 · Published January 28, 2025

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their respective role in enhancing model generalization in rule-based reasoning tasks remains unclear. This paper studies the comparative effect of SFT and RL on generalization and memorization, focusing on text-based and visual reasoning tasks. We introduce GeneralPoints, an arithmetic reasoning card game, and also consider V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both novel textual rules and visual domains. We show that RL, especially when trained with an outcome based reward, generalizes in both the rule-based textual and visual environments. SFT, in contrast, tends to memorize the training data and struggles to generalize out-of-distribution in either scenario. Further analysis reveals that RL improves the model’s underlying visual recognition capabilities, contributing to its enhanced generalization in visual domains. Despite RL’s superior generalization, we show that SFT is still helpful for effective RL training: SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains. These findings demonstrate the advantage of RL for acquiring generalizable knowledge in complex, multimodal tasks.

Although SFT and RL are both widely used for foundation model training (OpenAI, 2023b; Google, 2023; Jaech et al., 2024; DeepSeekAI et al., 2025), their distinct effects on generalization (Bousquet & Elisseeff, 2000; Zhang et al., 2021) remain unclear, making it challenging to build reliable and robust AI systems. A key challenge in analyzing the generalizability of foundation models (Bommasani et al., 2021; Brown et al., 2020) is to separate data memorization1 from the acquisition of transferable principles. Thus, we investigate the key question whether SFT or RL primarily memorize training data (Allen-Zhu & Li, 2023a; Ye et al., 2024; Kang et al., 2024), or whether they learn generalizable rules that can adapt to novel task variants.

To address this question, we focus on two aspects of generalization: textual rule-based generalization and visual generalization. For textual rules, we study the ability of a model to apply learned rules (given text instructions) to variants of these rules (Zhu et al., 2023; Yao et al., 2024; Ye et al., 2024). For vision-language models (VLMs), visual generalization measures the consistency of performance with variations in visual input, such as color and spatial layout, within a given task. For studying text-based and visual generalization, we investigate two different tasks that embody rule-based and visual variants. Our first task is GeneralPoints, an original card game task similar to Points24 of RL4VLM (Zhai et al., 2024a), which is designed to evaluate a model’s arithmetic reasoning capabilities. The model receives four cards (presented as a text description or an image), and is required to compute a target number (24 by default) using each card’s numerical value exactly once. Second, we adopt V-IRL (Yang et al., 2024a), a real-world navigation task that focuses on the model’s spatial reasoning capabilities.

RL learns generalizable rules (expressed in text), where in-distribution performance gains also transfer to unseen rules. In contrast, SFT appears to memorize the training rules and does not generalize (see Figure 1 for an example). Beyond textual rule-based generalization, we further investigate generalization in the visual domain and observe that RL also generalizes to visual OOD tasks, whereas SFT continues to struggle.

Although RL exhibits superior generalization compared to SFT, we show that SFT is still necessary to stabilize the model’s output format, enabling RL to achieve its performance gains. Last but not least, we observe that scaling up the inference time compute by increasing the number of maximal steps leads to better generalization.

Memorization and generalization in LLM/VLM. Several studies have examined the interplay between memorization and generalization in neural networks (Han et al., 2022; Carlini et al., 2022; Yang et al., 2023). In LLMs, memorization can manifest as the model memorizing the training data (Carlini et al., 2022; Jiang et al., 2024; Kang et al., 2024), while generalization reflects the divergence between the model’s output distribution and the pre-training data distribution (Zhang et al., 2023). Prior studies suggest that LLMs exhibit more overfitting on simpler, knowledge-intensive tasks and greater generalization on more complex, reasoning-intensive ones (Wang et al., 2024; Qi et al., 2024). For example, recent studies (Ye et al., 2024; Allen-Zhu, 2024; Allen-Zhu & Li, 2023a;b; 2024; Tong et al., 2024b) have demonstrated that LLMs develop reasoning skill sets beyond their training data by pre-computing reasoning graphs before autoregressive generation, which provides compelling evidence of generalization. Our study takes a different approach by investigating the role of different post-training paradigms on memorization versus generalization in the context of textual rule based and visual variants.

Scaling up inference-time compute. Recent research has increasingly focused on scaling up inference-time computation to improve model performance (Wei et al., 2022b; Yao et al., 2024; Snell et al., 2024; Jaech et al., 2024). Early studies (Wei et al., 2022b; Yao et al., 2024) prompted models to generate intermediate reasoning steps and extend the responses before producing a final answer. Subsequent work (Zelikman et al., 2022; Feng et al., 2023; Tian et al., 2024; Chen et al., 2024a; Snell et al., 2024) has demonstrated that fine-tuning verifiers during inference improves model accuracy, effectively utilizing test-time computation. Notably, recent findings (Jaech et al., 2024; DeepSeekAI et al., 2025) reveal “scaling laws” for inference-time compute, highlighting significant performance gains with increased computational resources. Our work builds upon these findings in two ways. First, we integrate insights from inference-time verification into a multi-turn RL formulation that allows the model to identify and correct its errors. Second, we examine the impact of inference-time verification on RL generalization, demonstrating that scaling up inference-time verification (in terms of the maximum number of verification steps) is a key for RL to generalize.