FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Paper · arXiv 2307.10928 · Published July 20, 2023

evaluating the alignment of LLMs to human values is challenging for two reasons. First, open-ended user instructions usually require a composition of multiple abilities, which makes measurement with a single metric insufficient. Second, since these instructions are task-agnostic, the required abilities often vary from one instance to another, making it impractical to use a fixed set of metrics.

Benchmarks that adopt multiple metrics are not scalable since each of them targets different skills, domains, and difficulties such as GSM8K (Cobbe et al., 2021) for logical correctness, and TruthfulQA (Lin et al., 2022) for truthfulness. Also, relying on these automatic metrics limits interpretability and reliability because only task-wise analysis is possible and automatic metrics are sensitive to surface forms (Krishna et al., 2021). Moreover, merely assigning a single score based on preferences does not tell the whole story because there could be multiple axes to evaluate the response, such as completeness, factuality, etc. Instead, we need to evaluate the model’s performance using fine-grained criteria to comprehend the model from various perspectives.

we propose FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a novel evaluation protocol that adopts a fine-grained scoring setup, enabling task-agnostic skill evaluation aligned with the provided instructions. We define 4 primary abilities which are divided into 12 fine-grained skills for comprehensive language model evaluation: Logical Thinking (Logical Correctness, Logical Robustness, Logical Efficiency), Background Knowledge (Factuality, Commonsense Understanding), Problem Handling (Comprehension, Insightfulness, Completeness, Metacognition), and User Alignment (Conciseness, Readability, Harmlessness). First, we collect a total of 1,740 evaluation instances from various NLP datasets and annotate the relevant set of skills (a skill set), domains, and the difficulty level for each instance. Then, evaluators assign scores ranging from 1 to 5 for each annotated skill based on the reference answer and skill-specific scoring rubrics, where the evaluators could be human evaluators or state-of-the-art LLMs2.

We observe that current open-source LLMs significantly underperform proprietary LLMs for Logical Thinking and Background Knowledge abilities.

• We observe that some skills such as Logical Correctness and Logical Efficiency require larger model sizes to effectively acquire them compared to other skills.

• We show that even state-of-the-art proprietary LLMs struggle on FLASK-HARD set, up to 50% performance degradation for some skills compared to the whole FLASK evaluation set.

We suggest that comprehensive analysis of LLMs through fine-grained evaluation is important and practical for both the developers and practitioners.

To assess multiple aspects of the model response, multi-metric evaluation settings have been proposed, providing a more comprehensive perspective of the model performance beyond accuracy (Liang et al., 2022; Thoppilan et al., 2022; Fu et al., 2023; Jain et al., 2023; Lee et al., 2022). Furthermore, to faithfully evaluate LLMs on tasks such as fact verification or summarization, recent works have proposed fine-grained atomic evaluation settings (Min et al., 2023; Krishna et al., 2023). Especially, Wu et al. (2023a); Lightman et al. (2023) show that fine-grained evaluation of model responses could be utilized for better rewards. In FLASK, we adopt an instance-wise fine-grained multi-metric setting

Our proposed categorization includes four primary abilities, each of which is further divided into 2-4 skills, resulting in a total of 12 skills:

• Logical Thinking refers to the ability to apply reasoning, critical thinking, and deductive skills when processing and responding to instructions. In order to do so, models should generate a logically correct final answer (LOGICAL CORRECTNESS) while preserving generalizability during the step-by-step logical process without any contradiction (LOGICAL ROBUSTNESS). Also, the logical process should be efficient and not contain any unnecessary steps (LOGICAL EFFICIENCY).

• Background Knowledge comprises the capacity to generate responses by accessing a broad repository of general and domain-specific information. This ability requires the model to provide accurate and contextually relevant responses to instructions requiring factual (FACTUALITY) or commonsense knowledge (COMMONSENSE UNDERSTANDING).

• Problem Handling pertains to the proficiency in addressing challenges that emerge while processing and responding to user instructions. This category encompasses the capacity to understand the implicit and explicit purpose and requirements of the instruction (COMPREHENSION), develop creative perspectives or interpretations of the instruction (INSIGHTFULNESS), handle the instruction by providing in-depth and in-breadth information (COMPLETENESS), and be aware of its own capability to answer the instruction (METACOGNITION).

• User Alignment represents the ability to empathize with the user and align its responses to the user’s intentions, preferences, and expectations. This category encompasses the model’s ability to structure the answer to promote the users’ readability (READABILITY), presenting a concise response for the reader without unnecessary information (CONCISENESS), and considering potential risks to user safety (HARMLESSNESS).

we annotate the metadata which consists of 1) the essential skills to follow the instruction, 2) target domains, and 3) the difficulty level of the instructions. We first validate that human labelers and EVAL LM have a high correlation for the metadata annotation on a subset of 200 instances.

the EVAL LM selects the top-3 essential skills required to follow the instructions for each instance, from the 12 skills defined in Section 3.1. We achieve this by providing the EVAL LM with the instruction, reference answer, and descriptions of all 12 skills. For domain annotation, we identify 10 domains: Humanities, Language, Culture, Health, History, Natural Science, Math, Social Science, Technology, and Coding by modifying the Wikipedia categorization of Reid et al. (2022). Lastly, for difficulty level annotation, we divide the difficulty level into 5 levels based on the extent of required domain knowledge by referencing Webb’s depth of knowledge (Webb, 1997; 1999) and NIH proficiency scale3: simple lifestyle knowledge, advanced lifestyle knowledge, formal education knowledge, major-level knowledge, and expert-level knowledge where we map each level into a level from 1 to 5.

By comparing GPT-3.5 and the other two open-source models (VICUNA and WIZARDLM), we observe that Problem Handling and User Alignment abilities can be almost fully imitated, including Metacognition, Readability, and Conciseness. However, a large gap is especially noticeable in Logical Thinking and Background Knowledge abilities. This result aligns with Gudibande et al. (2023) which demonstrates that the open-source models only imitate the style of the proprietary models rather than the factuality.

skills such as Readability, Harmlessness, and Metacognition show slow improvement as the model scales up.

skills such as Logical Robustness, Logical Correctness, and Logical Efficiency show rapid improvements. Using FLASK, we confirm the findings of Gudibande et al. (2023) that skills requiring logical reasoning or fact retrieval benefit significantly from model scaling. Interestingly, we observe that for some skills, the performance nearly saturates after a particular scale; Logical Efficiency and Conciseness after 30B, Insightfulness after 13B and Metacognition after 7B.