Acknowledging the Lack of Evidence that Remote Test Administration Can Measure Up to In-Person Testing
A recent article in Forbes Magazine by Jim Cowen painted an irresponsibly rosy picture of remote test administration. The two organizations that coordinated the interviews on which the article was based, The Collaborative for Student Success and EducationCounsel, have been unabashed in their advocacy for returning to state summative testing in spring 2021. I have no problem with people and organizations advocating for what they believe, but when such advocacy causes others, in this case state education leaders, to take action, recommendations should be based on science rather than just beliefs. To make matters worse, the article was based only on interviews with executives from three major summative testing companies. It is irresponsible to make it sound like remote testing as a comparable substitute for in-person testing is a settled matter. It’s not.
The Center for Assessment collectively serves on approximately 40 state and district assessment and accountability technical advisory committees (TAC). Personally, I serve on and/or coordinate eight TACs, all of which met this fall. We discussed remote administration and proctoring at each of these meetings, and had demonstrations by multiple assessment providers. TAC members (generally nationally-renowned measurement experts) are a cautious bunch, but that caution comes from years of experience of seeing how little it takes to cause things to go wrong with test score interpretations.
The experts from all eight TACs appreciated the considerable progress the testing companies have achieved in designing and implementing remote testing platforms. However, universally, they were concerned that these companies have not yet produced evidence that scores from remote- and in-school -administered tests can be validly compared and aggregated. Just because a test can be delivered to a student remotely does not mean educators and leaders can treat the scores interchangeably with tests administered in traditional in-school settings. To be fair, the companies do not yet have the data and experience to produce such evidence, but I question whether this is the year to conduct such experiments. I discuss these concerns below.
Threats to Valid Interpretation of Test Scores From Remote Administration
Security issues are typically among the first concerns raised about remotely-administered tests because the remote proctoring necessary to prevent cheating would require a level of intrusiveness would not be acceptable and may not be legal in many states. If security concerns are overcome, there are many other threats to the interpretability of scores from remotely administered tests.
Differential access to the internet, devices, and suitable environments for instruction and assessment threaten the comparability of scores from remote and in-person tests. Even relatively stable internet access is no guarantee that testing will not be interrupted; all of us have experienced failed Zoom connections during important meetings. We must also consider how differences in test administration conditions will affect student motivation. For example, one student may be able to test in a quiet location with little disturbance (e.g., a private bedroom), whereas another student must complete the test in a location filled with distractions (e.g., at the kitchen table with multiple siblings) and both of these examples must be contrasted with typical administration conditions in a classroom full of peers and a teacher.
Critically, test accommodations requiring the participation of special educators, translators, or other specialized personnel may not be available at all or their administration may fall to an untrained adult. Beyond the obvious comparability threats when some students receive appropriate accommodations and others do not, there are potential legal issues under the Individuals with Disabilities Education Act (IDEA) and Title III of the Every Student Succeeds Act if students do not receive necessary accommodations.
There are many more threats to accurate interpretations than these examples. These challenges make it doubtful that test scores from remote and in-person administrations can be combined such that valid interpretations are supported. Being able to aggregate and compare results across schools, districts, and subgroups is the cornerstone of current school accountability systems, but it is also critical for accurate monitoring of score trends.
In the Forbes article, the testing company representatives make strong claims about how “states and districts will still be able to use the data from a remote administration to guide decision making,” and “they will still be able to disaggregate data to the student group level” even if they are not able to test all students.
Strong assertions require strong evidence. Thus far, I have seen no evidence to support such promises.
Comparing test scores is challenging in the best of conditions. Several of my colleagues and I contributed chapters to a recently released National Academy of Education volume, Comparability of large-scale educational assessments: Issues and recommendations, which described the many threats to test score comparability. Changes in the population and subgroup numbers and proportions of tested students are critical threats, among many others, to the comparability of group scores. It is irresponsible to suggest that states disaggregate test scores by subgroups and use these disaggregated results to make decisions without carefully evaluating the comparability of the scores.
Wait, Why Are We Testing This Year?
The developing consensus is that typical test-based accountability should be paused this year because of too much uncertainty and uneven effects of the pandemic on various communities. If test-based accountability is off the table, what is the purpose of administering state tests?
Assessment providers and users need to be clear about the purposes of any assessment in order to evaluate the validity of the test score interpretations. Many have postulated that test score information can help policymakers understand the scope of the pandemic effects on student learning so they can direct discretionary resources appropriately, a use case that is first dependent on the assumption that states will have discretionary resources available. I might support this use case for tests administered under somewhat normal conditions (i.e., in school). Unfortunately, if tests are administered this spring, the conditions will be anything other than normal.
At only one point in the Forbes article do we see a bit of caution:
“Given the vast impacts that this pandemic has had on our education systems, including so many factors being outside the influence of teachers and students, we would urge states to be thoughtful in how data and results are used this year.”
Unfortunately, this is too little, too late. Essentially the entire article pushes the utility of remote testing except for this little suggestion to be thoughtful. That’s not helpful and it puts states in a tough position. Their TAC members are urging caution with remote testing, yet state policymakers might be reading a prestigious publication like Forbes urging states to don’t worry and be happy with remote testing. It’s simply unfair to put state leaders in this position.