Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE [14], which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks.
This work puts forward the following contributions.
• A novel reward-shaping framework for GRPO that uses an efficient encoder-only model to provide a dense, semantic similarity-based reward for explanation quality.
• A demonstration that this semantic-based reward, combined with auxiliary rewards for correctness and formatting, significantly improves explanation faithfulness and clarity over strong SFT baselines.
• An empirical analysis showing that the resulting model outperforms its SFT counterpart not only on the target domain but also on out-of-domain reasoning tasks, as evaluated by an external LLM judge.
Self-reward and pedagogical alignment. Chen et al. show that LLMs can bootstrap hidden reasoning skills via a self-rewarding variational formulation (LaTRO) without external feedback [2]. Complementarily, Sonkar et al. collect human preferences for pedagogical alignment, tuning models that guide learners through sub-questions rather than revealing answers directly [27]. Our work shares this tutoring objective while focusing on high-stakes medical-school admission content.
Compared with the above literature, our primary innovation is the use of a lightweight encoder-only transformer as a semantic similarity scorer within the GRPO framework, providing a practical and effective alternative to both LLM-as-a-judge and more naive lexical-overlap rewards for aligning explanations.