Reliability: Marking vs Judging

Chris Wheadon
The No More Marking Blog
2 min readDec 20, 2016

--

I recently blogged about judging rather than marking GCSE English mock exams. Kieron Bailey, Head of Faculty at Parker E-Act Academy, who asked his Department to judge the long essay question for the AQA GCSE English Literature Paper 1 mock (30 marks) was kind enough to share his data with me.

In my post encouraging schools to judge I suggested that the marking reliability of 24 mark questions was likely to be +/- 6 marks according to the marking reliability research published by Ofqual.

How did the judging of the school compare to what we expect of marking reliability?

The school judged the work of 210 pupils, using 10 judges doing 120 judgements each. In total, therefore, they completed 1,200 judgements, with each script being judged 11 or 12 times. The judgements were done over various sittings, but it would seem that on average each judge took around an hour to complete their 120 judgements.

On completion the Head of Department looked through the results and decided that the best piece of work would be worth 23 marks, and the worst 5 marks. That gives us a range of 18 marks.

The average Standard Error of the True Score was 0.98 logits. The True Score range was -6.54 logits to 5.35 logits, a range of 11.89 logits.

We then transform the true scores to the scaled scores using the following method.

WANTED RANGE = Range of scaled scores wanted = 18
RANGE = Current range of true scores = 11.89
USCALE = WANTED RANGE / RANGE = 18 / 11.89 = 1.51
SCALED SCORE SE = (True Score SE * USCALE) = 0.98 * 1.51 = 1.49

We can estimate that the judged scores are correct to +/- 1.49 marks.
We would expect traditional marking to be correct to +/- 6 marks.


The judging of this school would certainly seem to have resulted in a more reliable outcome than had they marked the scripts. Had the school done more judgements per script, the reliability of the judging could be even higher still.

--

--