LLM Reasoning and Architecture Reinforcement Learning for LLMs Design & LLM Interaction

Do all AI skills improve equally as models scale?

Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

FLASK (Fine-grained Language Model Evaluation) decomposes LLM capability into 12 skills across 4 primary abilities, revealing that "model quality" is not a single dimension but a portfolio of capabilities with distinct scaling behaviors:

Logical Thinking (3 skills: Correctness, Robustness, Efficiency) — improves rapidly with model scale. These are the skills that differentiate larger models most clearly. Logical Efficiency and Correctness show steep improvement curves through 70B.

Background Knowledge (2 skills: Factuality, Commonsense Understanding) — also benefits significantly from scaling, but relies on pretraining data coverage rather than emergent capability.

Problem Handling (4 skills: Comprehension, Insightfulness, Completeness, Metacognition) — mixed scaling. Insightfulness saturates at ~13B. Metacognition saturates at ~7B.

User Alignment (3 skills: Readability, Conciseness, Harmlessness) — relatively flat scaling. These skills can be "almost fully imitated" by open-source models trained on proprietary model outputs.

The style-vs-substance gap is the most practically important finding: open-source models distilled from proprietary models copy Problem Handling and User Alignment (the style dimensions) but fail at Logical Thinking and Background Knowledge (the substance dimensions). This confirms Does supervised fine-tuning actually improve reasoning quality? at a more granular level — imitation learning acquires superficial capabilities while missing the underlying reasoning.

The saturation points have deployment implications: there's no point scaling past 7B for metacognition or past 30B for logical efficiency. The returns are flat. But for logical correctness and factuality, scaling continues to help. This means the optimal model size depends on which skills the application requires — a 13B model with Insightfulness saturation is sufficient for creative tasks but not for mathematical reasoning.

The FLASK-HARD subset reveals another pattern: even state-of-the-art proprietary models show up to 50% performance degradation on hard instances for some skills. The difficulty dimension interacts with skill type — some skills degrade gracefully with difficulty (Readability), others collapse (Logical Robustness).


Source: Self Refinement Self Consistency Feedback — FLASK (arxiv 2307.10928)

Related concepts in this collection

Concept map
17 direct connections · 192 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

evaluation skills scale differently with model size — logical reasoning improves rapidly while metacognition and readability saturate early and style imitation masks capability gaps