Expectations for Automated Scoring | Center for Assessment

Using Advances in Technology to Improve the Quality of Educational Assessment

Earlier this month, the Center for Assessment held its 15th annual colloquium, the first named in honor of Center co-founder, Brian Gong. The Brian Gong Colloquium is a two-day meeting arranged by the Center for Assessment to discuss select topics of importance in testing and accountability with recognized experts.

The 2019 meeting focused on the present and future use of learning analytics, machine learning, and artificial intelligence in educational assessment, and highlighted topics including:

The current state of the art in automated scoring, item generation, and text authoring
Speech recognition technologies and applications
Scenario-based simulations

We discussed the current applications, utility, promises, validity (of course), and limitations of these technologies. These discussions were rich and informative and provided insight into many potential directions within assessment.

However, for this post, I want to focus on a topic that is of special interest to me, automated scoring.

The Promise of Automated Scoring

The fundamental motivation to build and deploy automated scoring engines has been to maintain and even increase the number of open-response items while avoiding increased scoring time and cost. I propose we consider better scores as another benefit beyond faster and cheaper. My rationale for changing our expectations from parity with human raters to improving human scoring is embedded in advances in the state of the art and recent research on the psychometric properties of automatically-produced scores.

There are at least three important ways that automated raters may have the potential to improve the quality and use of open-response scores:

Designing for automated scoring requires more explicit descriptions of the expected response features in scoring rubrics, which has been shown to improve both human and machine scores.
Recent improvements in speech recognition accuracy for young children makes it possible to universally assess reading fluency in early grades as an indicator for language disabilities.
Increased scoring consistency has the potential to reduce the amount of rater error typically found in open-response scoring, allowing for more stable results and reduced threats to the comparability of test scores.

The transition from paper-based to computer-based testing has not been smooth due to logistical issues such as insufficient technological infrastructure in schools and measurement concerns such as score comparability across testing modes. Certainly, there are also well-known barriers to increasing the use of automated scoring. These include:

a lack of representative data sets for model training, validation, and research
long development timelines and high costs
data privacy concerns
our ability to demonstrate the validity of automatically-produced scores, and possibly the greatest barrier to fully realizing the benefits of automated scoring
trust in sores generated by automated scoring

A common criticism is that automated raters do not produce scores in the same way human raters do. Unfortunately, we do not fully understand how humans score examinee responses, and we do not really focus on how human raters produce scores. Mitigating threats to human score quality has rested largely on detecting bias and inconsistency among scorers through the traditional inter-rater reliability (IRR) statistics. However, these basic descriptive statistics, which are focused on scoring consistency, can miss some problematic elements of scoring accuracy, and, consequently, we miss the opportunity to intervene with training or rescoring measures.

How much rater error is acceptable when it comes to the accuracy of scoring student responses? Secondarily, are there automated rater training and validation methods that might provide better scores than traditional scoring practices? The answer is more likely to be yes than no, which means we are really at an exciting point where we have the opportunity to consider where current assessment paradigms will continue to apply, and where they might be improved by technology-based solutions.

Although the ways in which automated scoring differs from human scoring are easy to point out, our responsibility to ensure the reliability and validity of scores remains the same. We must continue to start with clear definitions of what we wish to measure and commit to end with producing evidence that test scores may be interpreted as intended.

Supporting the Effort to Advance Automated Scoring

There is one initial and very important way in which states and large school districts can support the development of advances in the utility of automated scoring and realize the benefits of faster, cheaper, and better scoring – in a word, data. The quality of engine training and the research on its downstream effects on score scales is often constrained by the complexity and high cost of acquiring samples of items and student responses. Lack of data is a substantial impediment to advancing the utility of automated scoring across more subjects and item types. Scoring engines can become more useful and produce more valid score interpretations if more data for a wider range of item types become available to developers and researchers.

There are also areas of research in which the field can engage that might move automated scoring beyond its current limitations. In my next post, I will summarize how far the testing profession has come with automated scoring and comment on where I think research has the potential to take us.