Rater Monitoring with Inter-Rater Reliability may Not be Enough for Next-Generation Assessments
Testing experts know a lot about how to conduct scoring of students’ written responses to assessment items. Raters are trained under strict protocols to follow scoring rules accurately and consistently. To verify that raters did their job well, we use a few basic score quality measures that center on how well two or more raters agree. These measures of agreement are called inter-rater reliability (IRR) statistics, and they are widely used, perhaps in part because they are easy to understand and apply.
These well-established procedures have allowed us to produce defensible scores for tests with many multiple-choice items and few constructed items. But the truth is, all raters are at least a little inaccurate and for practical reasons we have accepted that reality.
When a test has only a few constructed response items, the impact of rater inaccuracies on test scores is likely to be small because their influence on the total score is diminished. In contrast, when a test has a greater amount of constructed response items, rater inaccuracies have a larger overall impact. This scenario begs the question: how can we mitigate the accumulating impact of rater inaccuracies on student scores?
Mitigating the Presence and Impact of Rater Inaccuracies
There are no easy solutions to this challenge, but there are three distinct opportunities to mitigate rater inaccuracies:
- during item and rubric development,
- preceding and during operational scoring, and
- after operational scoring.
Before Operational Scoring: Item and Rubric Development
The specificity of scoring criteria can effectively reduce rater inaccuracies. Rubrics that explicitly describe the attributes required in student responses provide important clarity for raters. When rubrics contain indeterminate language to differentiate levels of student performance, raters have a more difficult time making distinctions between levels of performance. However, when concrete, explicit language or exemplar responses are provided, the personal bias in the scoring process is further discouraged. Of course, we want to be sure the revised rubric accurately reflects the purpose and rigor of the item.
Leacock, Gonzalez, & Conarroe (2014) provide an example where a relatively simple change to a rubric resulted in a large improvement in score quality, without sacrificing score validity. The original rubric defined a score of 3 as, “Response includes a thorough and accurate explanation with at least two details from the passage that clearly support the argument.” The revised rubric changed this definition to, “Response includes the required concept and provides two supporting details” (pg. 6). These types of changes were shown to produce a remarkable improvement of up to 30% in rater agreement outcomes.
During Operational Scoring: Training and Monitoring
The second window of opportunity to mitigate the impact of rater inaccuracy is to control it directly during operational training and scoring. One common approach is to seed expertly-scored responses into the pool of examinee responses. Ratings on these seeded responses are then evaluated for their agreement with experts, and decisions are made about whether to retrain individual raters, rescore their work, or even dismiss them when their performance does not meet a predetermined threshold.
After Operational Scoring: Statistical Corrections
The third and last opportunity to mitigate rater inaccuracy on final examinee scores is to use specialized statistical procedures to detect and correct it. There are two general approaches that might be considered.
One approach is to adjust scores based on measures of rater drift. The other is a collection of methods (e.g. Linacre, 1989; Verhelst & Verstralen, 2001; Wilson & Hoskens, 2001; Patz, Junker, Johnson, & Mariano, 2002; and Casabianca, Junker, & Patz, 2016) designed to quantify rater inconsistency and inaccuracy, and identify the most accurate score for each examinee’s response. Rater drift supports decisions about final scores relative to past scores, whereas rater models support decisions based on direct measures of rater inconsistency and inaccuracy.
- Adjusting for Drift. Conceptually, drift adjustments are simple. Examinee responses are scored, and then a representative sample is rescored by a different set of raters in a different administration of the same item. Any differences in scores between the two rater groups is called drift and the size of the difference (e.g. difference in the average scores between the two rater groups) determines the size of the adjustment to scores in the second administration of the item. Since the goal is only to make scores between rater groups comparable, typically it is irrelevant whether we are adjusting from lower to higher score quality, or vice versa.
- Rater Modelling. Rater models quantify rater error directly, so their utility is distinguished from IRR and drift adjustments by their potential to produce item scores for each response that account for rater inaccuracy. In practice, these methods require additional empirical research and are more complex than IRR. However, they do offer a direct means to mitigate the impact of accumulating rater inaccuracies as we construct tests with more constructed response items. As we simultaneously increase our reliance on automated raters to produce a single score of record, this type of approach might also provide a more solid basis for improving the quality of automatically produced scores as well.
The Bottom Line
Whether one or all items on a test require written responses, IRR indirectly measures their score quality one item at a time. Agreement statistics cannot be used to establish the amount or type of rater inaccuracy that is present in item scoring, and therefore its potential impact on total test scores cannot be fully characterized or controlled without the application of additional methods or statistical models.
Finally, the agreement thresholds that are used in practice to defend score quality may leave room for more rater inaccuracy than we are willing to accept as the number of constructed response items increases. I propose that it is time to consider a more comprehensive view of controlling score quality and recommend using a multistage approach that attends to mitigating rater inaccuracies before, during, and after operational test scoring.
Casabianca, J.M., Junker, B.W. & Patz, R.J. (2016). Hierarchical rater models. In W.J. van der Linden (Ed.), Handbook of Item Response Theory Volume One: Models. Boca Raton, FL: CRC Press.
Leacock, Claudia, Gonzalez, Erin, Conarroe, Mike. (2014). Developing effective scoring rubrics for AI short answer scoring. McGraw-Hill Education CTB Innovative Research and Development Grant. Monterey: McGraw-Hill Education CTB.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.
Patz, R.J., Junker, B.W., Johnson, M.S., Mariano, L.T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341-384.
Verhelst, N. D., & Verstralen, H. H. F. M. (2001). An IRT model for multiple raters. In A. Boomsma, M. A. J. Van Duijn, and T. A. B. Snijders (Eds.), Essays on item response modeling. New York, NY: Springer-Verlag.
Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26, 283-306.