Do all AI skills improve equally as models scale?
Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.
FLASK (Fine-grained Language Model Evaluation) decomposes LLM capability into 12 skills across 4 primary abilities, revealing that "model quality" is not a single dimension but a portfolio of capabilities with distinct scaling behaviors:
Logical Thinking (3 skills: Correctness, Robustness, Efficiency) — improves rapidly with model scale. These are the skills that differentiate larger models most clearly. Logical Efficiency and Correctness show steep improvement curves through 70B.
Background Knowledge (2 skills: Factuality, Commonsense Understanding) — also benefits significantly from scaling, but relies on pretraining data coverage rather than emergent capability.
Problem Handling (4 skills: Comprehension, Insightfulness, Completeness, Metacognition) — mixed scaling. Insightfulness saturates at ~13B. Metacognition saturates at ~7B.
User Alignment (3 skills: Readability, Conciseness, Harmlessness) — relatively flat scaling. These skills can be "almost fully imitated" by open-source models trained on proprietary model outputs.
The style-vs-substance gap is the most practically important finding: open-source models distilled from proprietary models copy Problem Handling and User Alignment (the style dimensions) but fail at Logical Thinking and Background Knowledge (the substance dimensions). This confirms Does supervised fine-tuning actually improve reasoning quality? at a more granular level — imitation learning acquires superficial capabilities while missing the underlying reasoning.
The saturation points have deployment implications: there's no point scaling past 7B for metacognition or past 30B for logical efficiency. The returns are flat. But for logical correctness and factuality, scaling continues to help. This means the optimal model size depends on which skills the application requires — a 13B model with Insightfulness saturation is sufficient for creative tasks but not for mathematical reasoning.
The FLASK-HARD subset reveals another pattern: even state-of-the-art proprietary models show up to 50% performance degradation on hard instances for some skills. The difficulty dimension interacts with skill type — some skills degrade gracefully with difficulty (Readability), others collapse (Logical Robustness).
Source: Self Refinement Self Consistency Feedback — FLASK (arxiv 2307.10928)
Related concepts in this collection
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
FLASK quantifies this at skill level: style dimensions are imitable, reasoning dimensions are not
-
Does medical AI need knowledge or reasoning more?
Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
FLASK's skill decomposition provides the measurement framework for this claim
-
Are language models developing real functional competence or just formal competence?
Neuroscience suggests formal linguistic competence (rules and patterns) and functional competence (real-world understanding) rely on different brain mechanisms. Can next-token prediction alone produce both, or does it leave functional competence behind?
FLASK's skills map roughly onto this distinction: User Alignment ≈ formal competence (acquirable by imitation), Logical Thinking ≈ functional competence (requires genuine capability)
-
Can LLM judges be tricked without accessing their internals?
Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.
FLASK's differential scaling explains why LLM judges have systematic biases: User Alignment skills (readability, formatting) saturate early and are over-developed relative to Logical Thinking skills; judges therefore over-weight presentation features because those evaluation capabilities are disproportionately strong relative to their reasoning evaluation capabilities
-
Can reward models benefit from reasoning before scoring?
Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.
reward reasoning models specifically target the evaluation skills that FLASK shows scale with compute: by adding reasoning traces before scoring, RRMs invest additional compute in the Logical Thinking evaluation dimension that benefits most from scale, rather than the User Alignment dimension that saturates early
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
evaluation skills scale differently with model size — logical reasoning improves rapidly while metacognition and readability saturate early and style imitation masks capability gaps