Design & LLM Interaction Language Understanding and Pragmatics

Can AI generate assessment questions as good as human experts?

This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.

Note · 2026-02-23 · sourced from Psychology Users
How do people come to trust conversational AI systems?

A rigorous psychometric evaluation comparing ChatGPT-generated formative assessment questions to published Creative Commons textbook questions finds no statistically significant differences on the properties that matter for measurement quality.

Using Item Response Theory (IRT) with a linking methodology to ensure comparability, the study (N=207) tested 15 ChatGPT-generated items against 15 human-authored items from the same lesson content. Results:

This is notable because psychometric quality is a higher bar than surface-level plausibility. Difficulty and discrimination are the core parameters in educational measurement — they determine whether a question is appropriately challenging and whether it distinguishes students who understand the material from those who don't. Matching human experts on these parameters means the generation is functionally equivalent for formative assessment purposes.

However, the scope is constrained: one lesson summary, one textbook, formative (not summative) assessment. The generalization to diverse subjects, higher-stakes testing, or open-ended question formats remains untested.

The finding connects to a broader pattern in LLM generation quality: since Can LLMs generate more novel ideas than human experts?, structured generation tasks with clear constraints (like assessment items from a lesson summary) may represent a sweet spot where LLMs match or exceed human quality — while open-ended evaluative tasks remain a weakness.


Source: Psychology Users

Related concepts in this collection

Concept map
13 direct connections · 110 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LLM-generated assessment questions match human-authored questions on psychometric difficulty and discrimination parameters