Evaluating the psychometric properties of ChatGPT-generated questions

Paper · Source
Psychology Users

Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPT-generated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together.