Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Paper · arXiv 2405.02985 · Published May 5, 2024

We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75). This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters. The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education and has important implications for real-world education delivery.

open-ended and short answer questions require the student to answer a question using their own words in a few sentences [18]. Many researchers argue that they decrease the influence of test-taking strategies, have greater face validity, have lower risk of floor effects, and may be better suited to evaluate certain subprocesses of the skill being assessed [3, 8]. and for many formative assessment tasks may be preferable [4, 21]. However, the process of grading open-ended questions can be resource-intensive and expensive, which limits their widespread use [15].

These models often struggled with domain shift when implemented in educational contexts [5, 19]. Mitigating domain shift can be particularly challenging because even tasks that appear to be similar in their may have subtle differences that are not immediately evident but can greatly impact the model's performance [9, 22].