What skills can large models identify and organize about their own abilities?
This explores two linked questions: whether a model can recognize which skills it actually possesses, and whether those skills can be cleanly named and organized — by the model itself or by the researchers probing it.
This explores two linked things at once — whether a model can sort its own abilities into distinct, nameable skills, and whether it knows which of those skills it actually has. The corpus suggests the organizing mostly happens from the outside, while the self-knowing happens shallowly from the inside, and the two don't yet meet.
Start with organizing. When researchers decompose model ability into discrete skills, the skills behave very differently from one another. FLASK's 12-skill breakdown shows logical reasoning climbing steeply with scale while stylistic and metacognitive skills saturate early — metacognition tops out around 7B parameters, logical efficiency around 30B, and knowledge keeps improving Do all AI skills improve equally as models scale?. So 'skill' isn't one quantity; a model can look fluent (style copied well) while reasoning stays thin, which is exactly the gap distillation exposes. And some of these skills aren't even created by training — they're already latent in the base model and merely *selected* by post-training, whether through RL, decoding tweaks, or feature steering Do base models already contain hidden reasoning ability?. There's even machinery for activating skills on demand: tuning only the singular values of weight matrices yields composable 'expert vectors' that mix at inference without stepping on each other Can models dynamically activate expert skills at inference time?. So abilities can be organized, composed, and selectively switched on — but largely by us, not by the model reflecting on itself.
Now the self-knowing side, which is where the real surprise lives. Models do carry an internal signal for *what they know* versus what they don't: sparse autoencoders reveal an entity-recognition mechanism that tracks whether the model has facts about something, and this mechanism causally steers whether it answers or refuses Do models know what they don't know?. That's a genuine, mechanistic form of self-knowledge — but it's narrow, about facts, not about skills. Step up to broader self-report and it gets unreliable: models can describe behaviors they were never explicitly taught, yet those descriptions are unstable and shift under conversational pressure, which looks like surface awareness rather than real insight How well do language models understand their own knowledge?.
Worse, the self-assessment is biased. Models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* correct during evaluation — the fix is forcing comparison against outside alternatives Why do models trust their own generated answers?. This has a hard theoretical edge: a model can only reliably improve itself where it can verify better than it generates, and that generation–verification gap is what bounds self-improvement entirely What limits how much models can improve themselves?. So a model's ability to honestly audit its own skills is capped by a quantity it can't think its way past.
The quietly unsettling part for anyone trying to map abilities at all: the categories may be artifacts of how we measure. Sharp 'emergent skills' often dissolve into smooth curves the moment you switch to a continuous metric Are LLM emergent abilities real or measurement artifacts?, and two models with identical scores can hide completely different internal organization — perfect linear decodability sitting on top of fractured, fragile structure Can models be smart without organized internal structure?. The frontier where it matters most — autonomous science — needs self-correction above all, and that's precisely the skill that degrades rather than improves What capabilities do AI systems need for autonomous science?. The takeaway you didn't know you wanted: models have real but narrow self-knowledge about *facts*, almost no trustworthy self-knowledge about *skills*, and the neat skill taxonomies we use to organize them may say more about our metrics than about what's inside.
Sources 10 notes
FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
The Virtuous Machines framework identifies hypothesis generation, experimental design, data analysis, and iterative self-correction as essential for autonomous scientific research, none of which standard LLM benchmarks reliably evaluate. Self-correction poses the deepest challenge due to documented degradation in reasoning accuracy.