Developing robust assessment in the light of Generative AI developments

Project leader(s): Liz HardieJonquil LoweMychelle PrydeKevin WaughMirjam HauckFrancine RyanDaniel GoochVolker PatentKieran McCartneyClaire MaguireMike RichardsHeather Richardson
Theme: Externally funded projects
Faculty: FBLFASSPVC StudentsSTEMWELS
Status: Archived
Dates: January 2024 to September 2024

This was a cross-faculty NCFE (Northern Council for Further Education) funded project.

This study addresses the challenges and opportunities posed by generative artificial intelligence (GAI) tools, such as ChatGPT, for assessment in further and higher education. The research had three objectives: to identify the assessment question types that GAI was most and least able at answering; to investigate whether that varied across disciplines and levels; and to investigate markers ability to detect GAI scripts and whether training could improve GAI detection.

The research involved selecting 59 questions representative of 17 different question types (such as essays, short answers, numerical answers, reflections on work practice or skills, answers tailored to specific audiences, alternative output formats, and so on). The questions were selected across 17 disciplines and four levels ranging from Access to third-year undergraduate. For each question, a marker (43 markers in total) marked a first batch of eight scripts, awarding a grade, stating whether they suspected any scripts were AI-generated and, if so, their reasons for this. Markers then participated in a short, online training course and marked a second batch of different scripts for the same question. Although markers were not told how many GAI scripts were in each batch, there were five student scripts and three GAI scripts. The marking records, results of quizzes and a survey in the training, and transcripts from focus groups held after the second batch of marking were analysed using mixed methods (quantitative and qualitative).

Results showed that GAI achieved a pass in all the question types, except one. The exception was Activity plan, which was a consultancy role play question where students had to apply tools from the module, marking guidance was specific and the guidance was rigorously applied by the marker. GAI performed less well as the level of study increased and this pattern held across most disciplines. Training considerably improved markers’ ability to detect GAI scripts but also produced a large increase in false positives (student scripts being incorrectly classified by markers as AI-generated). This applied across the majority of question types. Markers reported that the training increased their confidence and ability to articulate their ‘gut feeling’ that a script was AI-generated, but they started to see the hallmarks of GAI everywhere, including in the student scripts. Trying to decide whether scripts were AI-generated also substantially increased markers’ workloads. Combining the results for GAI performance and likelihood of detection without substantial increases in false positives, the more robust question types were Activity plan, Audience-tailored, Observation by learner and Reflection on work practice. These broadly align with what is often called ‘authentic assessment’.

The findings suggest that the hallmarks of GAI are not a feasible guide to detecting AI scripts (because of the rise in false positives and increased workload). However, the hallmarks do point to ways that assessment questions and the associated marking guidance may be designed to be more robust, reducing GAI’s ability to achieve good passes. The report also makes recommendations regarding skills interventions, teaching critical AI literacy and academic conduct systems.