Focus, Fix, Fit: Understanding the Meaning of 2021 Test Scores

Feb 17, 2021

Finding a Path Forward Following 2021 Test Scores Using Existing Tools and Procedures

To answer the question of what 2021 test scores will mean, I start by acknowledging that the interpretation of assessment results is always a process of reasoning from evidence, with some level of uncertainty. As with all things COVID, we expect to have more uncertainty this year, but we still have the tools to examine just how uncertain we are about the meaning of test scores in 2021. I emphasize this point because planning to engage known measurement tools can get us out of the trap of trying to decide now how to use scores that cannot be empirically evaluated until the data are collected. 

The first step in analysis planning in 2021 is actually very simple. Make a clear decision about the direction of comparisons of 2021 test scores that will be prioritized. Once this decision is made, the path to selecting and engaging the right analytic tools can be laid out efficiently and effectively—an analysis plan can be oriented to compare back over time, to look forward from 2021, or it can focus on interpreting 2021 test scores in isolation.  

Looking Backward to Identify Changes in Achievement During the Pandemic

If the goal is to compare achievement before and after COVID, we are by definition looking backward to characterize, or quantify in some way, the real change in achievement that occurred during the pandemic. In this scenario, the importance of knowing if the test scale has drifted in any way becomes immediately clear, and we have many tools in our kit to check for scale drift. In fact, such analyses are routinely incorporated into evidence collection designs to support test score validity claims. 

The simplest way to make sure a test scale does not drift due to COVID effects is to “fix” test scales to their past properties. Taking this step essentially avoids the influence of COVID-related conditions on any newly-estimated properties that might distort the test scale.  Of course, fixing test scales does not by itself guarantee that we can simply proceed in a “business-as-usual” manner, but I will get to that shortly. 

Looking Forward and Viewing 2020 as a New Beginning

Comparing to the past is not the only decision that might be desired. In some cases, it may be more desirable to look forward from 2021. Many have suggested that the events of 2020 offer an opportunity to more permanently adjust teaching, learning, and assessment to fit a new educational environment as we emerge from the pandemic. If we view 2021 as a new beginning, we would consider updating or changing the test scale. For example, if we wish to more permanently adjust teaching, learning, and assessment to fit new educational priorities and conditions, we might consider changing the test scale to fit the new circumstances. 

In this context, we would presume that the effects of the pandemic are such that we believe the existing test scales will be inadequate for future comparisons and that it would not be useful to fix them to their historical properties, or to otherwise link to past score interpretations. In this case, we would view the scaling problem as the need to either update the scale or create a new one, with new score definitions. 

Looking at 2021 Student Performance in Isolation 

A third option is to treat 2021 test scores and achievement as isolated from any past or future score definitions. This perspective would be useful in cases where we wish to acknowledge upfront that what students learned in 2021 is substantively different from past patterns. From a scaling perspective, an isolated scale has the fewest constraints for implementation because linking is not required. Under certain conditions, we might even consider using simple raw scores (where all students take the same test) because we are no longer required to link score meaning to any other year. The challenge, of course, is that raw scores have limited meaning in terms of any future or past comparisons and are subject to considerable misinterpretation.

Placing Score Meaning in Context

A decision about the orientation of comparisons, of course, is not the only consideration for determining score meaning in 2021. It is merely the first one required in order to be clear about the right scaling path to take. We still have not fully answered the question of whether the conditions brought on by COVID have changed score meaning, but we have some great tools for that too. 

One approach is to collect opportunity to learn (OTL) data to help us understand impact and to tell the COVID story of educational achievement during the pandemic more generally (see: Aspen Institute & National Center for the Improvement of Educational Assessment, 2020; Dadey & Betebenner, 2020Domaleski, Boyer, & Evans, 2020Domaleski & Dadey, 2021Marion, 2020). OTL data can be collected through external systems set up by the state or as part of the assessment survey questions for schools, teachers, students, or even parents about the educational environment this year.   

Whether OTL data is collected or not, however, there are analyses that can be employed to understand the impact of changed conditions on score meaning in a more generalized way. We can ask simply, are students behaving differently than how we expect them to behave when they take their 2021 summative assessments? 

The most obvious sign of changes to student behavior will be whether they show up to test at all. If we build tests and students don’t come, achievement summaries for those who do show up will not mean the same thing as when everyone takes the test. Also, if we are either looking forward or only comparing achievement across students, groups, schools, or districts within 2021, lower participation rates may influence scale development. In short, setting a scale on a non-representative sample of students is likely to produce an unusable scale, which is particularly problematic for setting a new forward-looking path based on data from 2021 test scores. 

We also have some tools in our kit that take a less direct route to understanding student behavior, such as identifying unexpected student behavior like low motivation and possible administration irregularities (likely a higher risk with remote testing). Where these behaviors exist at increasing levels, score meaning is altered. Examples include standard irregularity detection routines, timing analysis, and examination of guessing behaviors. Examining differences in classical item analysis statistics from historical values will also be informative because it tells that students are responding to previously-administered items in ways that are different from the past. 

Although more rarely used, person-fit analyses are ideally situated to directly answer the question of whether students are behaving as expected. Person fit information helps detect if individual students and student groups (including students who may have taken their test remotely) are behaving consistently with how the test scaling model expects them to respond to the test items. The analysis does not provide an explanation for why students are not responding in expected ways, but the presence of a large amount of misfit that is different from historical patterns would provide a clear marker for changes to score meaning in 2021.  

Uncertainty Does Not Require that We Stop Moving Forward

To be sure, we have seen profound changes in teaching and learning. We do not know which schools will offer in-person instruction this spring, and which will not. We do not know the extent of differential impact the pandemic may have on student learning as a result of new and changing learning modes, physical and emotional health concerns, food, housing, and financial insecurity, and any number of other factors that have shifted under these extraordinary circumstances. In a nutshell, uncertainty is the prevailing sentiment.

A central challenge in planning to evaluate score meaning in 2021, then, is that it is occurring in an uncertain world, forcing us to plan for how we will use scores before we are even able to collect evidence about whether our test score scales will survive COVID. To avoid getting trapped in a round and round conversation on score meaning, continually alternating between discussions of how scores should be used in 2021 and how they can be used, we need to take a path that ends with an answer and not a question. That path starts with deciding what the most important score comparisons are, and then using the knowledge and tools we have to help us understand if we can make those comparisons with acceptable levels of certainty.