Reranking-based Generation for Unbiased Perspective Summarization
Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model–based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance.
These issues are especially problematic in opinionated article summarization (Amplayo et al., 2021; Iso et al., 2022), where unbiased representation of diverse viewpoints is crucial
However, two gaps remain unaddressed in this setting: (1) existing evaluation metrics are primarily derived from news summarization domains and have not been validated for measuring perspective summary quality, and (2) the effectiveness of LLM-based methods beyond zeroshot inference in generating unbiased, high-quality perspective summaries remains underexplored.
To address these gaps, we first identify effective metrics for measuring summary quality by constructing a test set to evaluate existing metrics. We focus on two key attributes that a desirable summary should have: perspective coverage—the extent to which the summary includes all key content from the intended perspective, and perspective faithfulness—the degree to which the summary excludes content unsupported by the source articles of the target perspective. We collect key point annotations from articles to create controlled summaries with varied key point selections and assigned ground truth scores. We find that language model-based metrics such as ALIGNSCORE (Zha et al., 2023) and prompting-based scoring (Zheng et al., 2023) serve as strong evaluators, whereas traditional metrics (ROUGE (Lin, 2004), BERTSCORE (Zhang et al., 2020)) underperform.
Notably, preference tuning with Direct Preference Optimization (DPO) (Rafailov et al., 2023) on reranked generations further boosts performance on both attributes and particularly improving faithfulness. Our results suggest that current LLMs can generate highquality perspective summaries with strong coverage and faithfulness, and that preference-based training can further boost performance. In summary, our contributions are as follows:
• We construct a controlled test set and identify effective metrics for measuring coverage and faithfulness for perspective summarization.
• We evaluate various generation methods and demonstrate that reranking-based approaches deliver the best performance in producing summaries with improved coverage perspective and faithfulness. Notably, preference tuning on reranked generations significantly improves both attributes, with the most pronounced gains in faithfulness.
• We conduct ablation studies and show that commonly employed prompting frameworks consistently underperform relative to reranking-based methods, even when scaled to high-resource settings.1
Beyond developing improved faithfulness metrics, prior works focus on improving the factual consistency of summarizers, with studies noting the tradeoff between abstractiveness and faithfulness (Durmus et al., 2020; Dreyer et al., 2023). Accordingly, some methods improve faithfulness without increasing extraction (Ladhak et al., 2022), while others modify training via contrastive (Nan et al., 2021), multi-task (Chen et al., 2022), or reinforcement learning (Roit et al., 2023) methods. In contrast, we show that reranking-based methods serve as a strong baseline that yields high faithfulness without sacrificing abstractiveness, and a DPObased approach trained on reranked self-generated summaries further improves both qualities.
Perspective-Conditioned Summarization. Existing research on opinion summarization and related tasks has primarily focused on domains such as product reviews (Bražinskas et al., 2020), while recent work has broadened to a range of tasks on opinionated texts. Most single-document methods aim to preserve authorial intent (Liu et al., 2024b) or polarity (Lei et al., 2024), whereas multidocument summarization must integrate varied perspectives. For instance, Lee et al. (2022b) generates politically neutral summaries from sets of left-, right-, and center-leaning news articles. Other approaches aim to fairly represent diverse perspectives in reviews (Zhang et al., 2024c), controllably represent community perspectives (Feng et al., 2024), generate consensus summaries (Bakker et al., 2022), or produce multiple summaries reflecting distinct political perspectives (Deas and McKeown, 2025). In line with these works, we summarize the political perspective among a set of input passages while addressing the coverage and faithfulness issues observed in existing models as highlighted in these studies.