Test Score Meaning Under Remote Test Administration (Part 1)

Oct 20, 2020

Why Validity Is Threatened Under Remote Administration Conditions

This is the first of two posts on planning for the examination of the validity of scores collected through remote test administration. Part 2 is a discussion of a framework for the types of analyses that will be useful for understanding the degree to which scores are comparable between remotely tested students and students tested in the classroom, what might be done to adjust them if they are not, and the conditions under which data can be collected to support those analyses.

The comparability of test score meaning between pre- and post-pandemic test administrations is uncertain. Previously, we discussed how states can plan in Getting Ahead of the Curve: Planning for Accurate Equating in 2021, acknowledging that with current levels of uncertainty, there are no guarantees that acceptable equating quality will be achieved. 

Even those of us in educational assessment who deal regularly in the science of uncertainty are not comfortable with quite this much of it, and to be perfectly blunt, some anticipated testing conditions may very well prevent accurate equating in any traditional sense. Central among these conditions is remote test administration. 

The reality of states and local educational agencies having no option but to remotely test students has already been realized for interim assessments, and it will remain a risk for Spring summative assessments. This new reality presents extreme challenges for the comparability of scores between students and over time. Where scores are not comparable, the validity of some planned interpretations will be difficult, if not impossible, to support. It simply may not be appropriate to proceed in a “business as usual” way relative to how we interpret and use standardized test scores. 

Foundationally, the comparability of score meaning over students and time depends on the stability of certain features of the system in which testing occurs—the test content, the conditions of measurement, and the students themselves. As we discussed in detail in Restart & Recovery: Assessments in Spring 2021, we are likely to experience changes in all three this year. 

How Remote Administration Might Be a Problem 

There are many reasons why we are concerned about the effect of remote testing on the comparability of score meaning, but we can group them into three general categories: access to the test, motivation to perform on the test, and opportunity to learn. Differences in any one may threaten the validity of the claims we wish to make based on student test scores. 

Access to the Test

  • Some students will not have consistent access to the types of internet connections required for smooth and uninterrupted test administrations.
  • Some students will not have access to devices for accessing test content in a way that meets test design and administration requirements.
  • Testing instructions may not be equally accessible without a human proctor present, particularly for students with approved accommodations, and English language learners.
  • Students may test remotely while others test at school.

Motivation to Perform on the Test

  • Students may have a greater opportunity to cheat, overall, and differentially across different remote proctoring scenarios.
  • Students may be more differentially motivated to do their best due to changes in expectations about student learning and performance in remote and hybrid contexts. 
  • Students may have differential access to a quiet home environment for testing without distractions or interruptions.

Opportunity to Learn Due to School Disruptions

  • Testing windows may not be the same for all students, creating differential periods of instruction and opportunities to learn before the test is administered.
  • The cumulative effect of school disruptions last year and this year on the amount and quality of instruction may not be the same for all students.

This last bullet point moves beyond the effects of remote testing to the broader effects of COVID disruptions, but it is important to include here because conditions of learning have a direct effect on the comparability of scores. Study designs to evaluate the comparability of scores will depend on information about changes in the conditions of learning for their accuracy. 

The Consequences of Ignoring Test Administration Differences

Normally, we work hard to establish standardized testing conditions so that we can begin with the assumption that test scores are comparable and check for unusual circumstances where that assumption might not hold. If tests are administered remotely, we will need to assume that there is an effect on score meaning due to these issues, and likely to others that we have yet to experience or anticipate. We will need to treat such scores with skepticism and apply greater scrutiny to the validity claims that depend on score comparability—this scrutiny also applies to comparability over time where we wish to compare this year’s scores to past and future years, and it applies to the comparability of scores between students and student groups in 2021. The latter is arguably a more serious threat for student- and group-level interpretations. If a given test score no longer means the same thing for students tested in the classroom and students tested remotely (under many different conditions), the interpretations of those test scores and inferences about student performance will be invalid.

Ultimately, the consequences of using non-comparable scores as interchangeable are serious and they include inaccurate determinations of student status, progress, and relative performance across students and student groups—the very things we rely on for classroom-, school-, district-, state-, and federal-level decisions, including accountability.

How We Will We Know if Remote Administration is a Problem

To be sure, there is much speculation at this time about the anticipated threats to the validity of scores from remotely-administered summative assessments in 2021. However, the only way to really know if score meaning is the same under remote administration conditions is to collect and study data for each condition. Doing so requires time and thoughtful planning, but fortunately, guidance is emerging that can support such efforts. See for example:

There are also some well-established designs and analyses that can be applied usefully to detect the effects of remote testing on examinee scores. 

In our next post, we will discuss a framework for applying such designs and analyses and the conditions under which score adjustments might be considered. We will also consider how some typical study designs may be impacted by the absence of data from 2020 due to testing waivers. Such studies will be the only way to empirically answer questions about whether scores have that same meaning in 2021 as in previous years, and therefore whether they can be used in current accountability systems or for other interpretations based on trends over time.  

Where score meaning is not the same and score adjustments are not possible or advised, modifications to how we interpret scores may be the only defensible option.