How well do language models understand their own knowledge?

How LLM self-awareness and confidence affect user reliance and interaction outcomes.

Topic Hub · 25 linked notes · 4 sections

View as

Self-Knowledge and Behavioral Awareness

7 notes

Can language models describe their own learned behaviors?

Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.

Do LLMs generalize moral reasoning by meaning or surface form?

When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.

Do users worldwide trust confident AI outputs even when wrong?

Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.

Why do people trust AI outputs they shouldn't?

When do human cognitive shortcuts fail in AI interaction? Three compounding traps—treating statistical patterns as facts, mistaking fluency for understanding, and avoiding disagreement—may explain systematic overreliance across languages and contexts.

Can a coordination layer turn LLM patterns into genuine reasoning?

LLMs excel at pattern retrieval but lack external constraint binding. Can a System 2 coordination layer—anchoring outputs to goals and evidence—transform statistical associations into goal-directed reasoning?

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Can cognition work by reusing memory instead of recomputing?

Does intelligence emerge from structured navigation of prior inference paths rather than fresh computation? This challenges whether brains and AI systems need to recalculate constantly or can leverage stored trajectories for efficiency.

Context, Prompting, and Interaction

8 notes

Why do LLM persona prompts produce inconsistent outputs across runs?

Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Can models abandon correct beliefs under conversational pressure?

Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.

Does reasoning fine-tuning make models worse at declining to answer?

When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.

How much does demo position alone affect in-context learning accuracy?

Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?

Can longer task training help shorter tasks extrapolate?

When models train on related tasks at different lengths, does solving a longer auxiliary task enable a shorter main task to generalize beyond its training length? This matters for understanding how neural networks transfer learned capabilities across related problems.

How does thinking emerge from policy selection in RL?

Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.

How much does the user shape what a model generates?

Prompt engineering is often framed as unlocking hidden capabilities, but what if users are actually imposing their own expectations onto model output? This explores whether refinement is discovery or confirmation.

Authorship, Metacognition, and the LLM Fallacy

3 notes

How does AI-assisted work reshape how people see their own abilities?

When users delegate tasks to AI, do they unknowingly integrate the system's outputs into their sense of personal competence? This explores whether AI interaction produces a specific form of self-perception distortion distinct from trust or effort issues.

Does processing ease mislead users about their own competence?

When AI generates polished output, do users mistake the fluency of that output as evidence of their own understanding or skill? This matters because it could systematically inflate self-assessment across millions of AI interactions.

Do users truly own the AI-generated content they produce?

When people use AI to create outputs, do they experience genuine authorship and ownership of what's produced, or does the continuous interaction loop create a gap between what they feel and what they claim?