Rater Monitoring with Inter-Rater Reliability may Not be Enough for Next-Generation Assessments
Testing experts know a lot about how to conduct scoring of students’ written responses to assessment items. Raters are trained under strict protocols to follow scoring rules accurately and consistently. To verify that raters did their job well, we use a few basic score quality measures that center on how well two or more raters agree. These measures of agreement are called inter-rater reliability (IRR) statistics, and they are widely used, perhaps in part because they are easy to understand and apply.