Can AI generate assessment questions as good as human experts?
This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.
A rigorous psychometric evaluation comparing ChatGPT-generated formative assessment questions to published Creative Commons textbook questions finds no statistically significant differences on the properties that matter for measurement quality.
Using Item Response Theory (IRT) with a linking methodology to ensure comparability, the study (N=207) tested 15 ChatGPT-generated items against 15 human-authored items from the same lesson content. Results:
- Difficulty parameters — no significant difference between pools
- Discrimination parameters — no significant difference, with some evidence ChatGPT items were marginally better at differentiating respondent abilities
- Response time — no significant difference
- Unidimensionality — ChatGPT items showed evidence of measuring a single construct and did not disrupt the unidimensionality of the original set when tested together
This is notable because psychometric quality is a higher bar than surface-level plausibility. Difficulty and discrimination are the core parameters in educational measurement — they determine whether a question is appropriately challenging and whether it distinguishes students who understand the material from those who don't. Matching human experts on these parameters means the generation is functionally equivalent for formative assessment purposes.
However, the scope is constrained: one lesson summary, one textbook, formative (not summative) assessment. The generalization to diverse subjects, higher-stakes testing, or open-ended question formats remains untested.
The finding connects to a broader pattern in LLM generation quality: since Can LLMs generate more novel ideas than human experts?, structured generation tasks with clear constraints (like assessment items from a lesson summary) may represent a sweet spot where LLMs match or exceed human quality — while open-ended evaluative tasks remain a weakness.
Source: Psychology Users
Related concepts in this collection
-
Can LLMs generate more novel ideas than human experts?
Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
assessment generation as structured domain where LLM generation parity holds; contrasts with open-ended evaluation weakness
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLM-generated assessment questions match human-authored questions on psychometric difficulty and discrimination parameters