Addressing the uncertainty around large-scale assessment scores.

Large-Scale Assessment and Learning to Love Uncertainty

Making Sense of Multiple Sets of Test Scores

In this post, I examine the ways in which we can make sense of student performance and improvement when presented with results from varieties of large-scale assessment such as state summative tests, NAEP, and interim assessments. Despite the potential uncertainty associated with multiple sets of assessment results, I discuss how we need to consider what we can learn from each to help advance student learning. 

The Nobel-winning and irreverent physicist, Richard Feynman, once said, “being a scientist means being in love with uncertainty.” In my two careers—first as a field biologist and now as an educational measurement professional—I’ve embraced uncertainty. But I recognize that most people do not. When people see a test score, their natural inclination is to think of it as an exact representation of what a student knows. This view of test scores is like nails on a chalkboard to those of us who study measurement because we understand that every observed score includes some error or uncertainty. 

As we emerge from COVID-caused disruptions to student learning, we at the Center for Assessment, along with many of our colleagues across the country, are trying to make sense of various national, state, and district assessments to understand the degree to which educators and leaders are helping students make up lost ground. There’s an old adage that made more sense before smartphones: “when you have one watch, you always know the time, when you have two, you’re never quite sure.” This same logic, when applied to assessments, adds additional uncertainty when the various assessments portray slightly or even substantially different “times.”

Be Thoughtful When Comparing State and Interim Assessment Results

We generally trust state assessment results to provide a more accurate picture of student learning than interim assessments, assuming we have access to enrollment and participation data, because we have a much better understanding of who was tested, what they were tested on, and how they were tested (remote or in-person) for the state compared with interim assessments.

That said, interim assessment companies have been able to collect data throughout the pandemic (with the exception of spring 2020) so we can gain some insight from this longitudinal information. The interim and summative assessment watches have generally been in the same time zone, except when they are not! 

How should state and district leaders with the responsibility to allocate ESSER and other funds based on academic need deal with differences in results? Should they believe in an optimistic picture of a “V-shaped” recovery, for example, or a more pessimistic view (perhaps realistic) of stabilization? They cannot cherry-pick. 

Rather, these leaders need to interrogate the results to try to understand why each might be telling a somewhat different story. The first step is to examine the characteristics of the sample of students who took each test in 2022 and in the comparison years as Damian Betebenner, Leslie Keng, and I recently explained. Additionally, state and district leaders need to look critically at the knowledge and skills being tested by each of the assessment programs. How much overlap is there with the knowledge and skills you expect your students to learn? Are there differences in the depth at which each test measures these content domains and how well does this match with how you expect your teachers to teach? 

So don’t get rid of any watches (unless you know they are broken) and stop searching for a single “right answer”. Instead, look carefully at what each watch can tell you.

NAEP is Here!

The National Assessment of Educational Progress (NAEP) 2022 Long-term Trend (LTT) results, released on September 1st, portrayed a stark picture of the performance of 9-year-old students – a dramatic decline since 9-year-olds were last tested in the winter of 2020. The LTT has consistently documented student achievement, at the national level only, since the early 1970s. 

The “Main” or State NAEP will be released on October 24th. The State NAEP, as the name implies, has been producing state-level results since the early 1990s. Main NAEP results will also be released for the Trial Urban District Assessment (TUDA), the 26 large urban school districts that participate in the assessment program. Both the state and district results will offer a comprehensive view of the effects of the pandemic across the country. 

The Main NAEP results will differ from the LTT for numerous reasons, mainly because the two tests are designed to measure different aspects of math and reading. The LTT focuses on relatively “basic skills” compared to some of the complex learning targets today’s students are expected to master. Main NAEP represents more updated and complex learning targets.

The Main NAEP was last administered in 2019, a year earlier than the LTT in 2020, so we have an additional year to consider when making comparisons to changes in performance over time on the two tests. Most people will not be concerned with comparing the Main and LTT NAEP scores. Rather, they will compare Main NAEP to state assessment and perhaps interim assessment results. 

Even though I am a member of the National Assessment Governing Board, I am not obliged to say NAEP is the truth. However, NAEP is able to avoid many of the challenges faced by state testing programs. The National Center for Educational Statistics (NCES), the agency responsible for operating NAEP, had the time and resources to ensure the 2022 and 2019 samples are as equivalent as possible. Additionally, NCES is able to ensure that the 2022 scores are validly linked to the same underlying score scale as the 2019 test (or 2020 for LTT).

Essentially all states have released their 2022 state assessment results by now. With the public release of the Main NAEP results on October 24th, states will be faced with the two-watch problem. In some (perhaps many) states the direction and magnitude of the state and NAEP scores will be fairly coherent. Almost certainly, there will be many cases where the two sets of assessments tell different stories. 

Gauging the State of Student Achievement as We Emerge from the Pandemic

State and district leaders should honestly examine the results of the various assessments to understand how the pandemic affected student achievement and the rate at which students are hopefully starting to recover. While there are some state results suggesting that student achievement is almost back to 2019 levels—and I question these—most indicate that students and schools still need massive efforts to recover. While we might have to wrestle with a bit of uncertainty about exact levels of performance, we know enough to act. 

Main NAEP’s most significant advantage is that it provides the first comparative analysis of how the pandemic affected student achievement across all states. These results provide us with a tremendous opportunity to gain insight into how the pandemic disruptions were manifested in different parts of the country. Unfortunately, in this environment, some will use the NAEP state-level results to make political points about different approaches to education during COVID.  

Avoid “Just So” Stories to Explain Large-Scale Assessment Results

We are great at creating stories to explain test scores AFTER we see the results just like Kipling’s narrators explained unique natural characteristics like the camel’s hump. These stories rarely match the a priori hypothesis people offer BEFORE they see the results. This was demonstrated conclusively by Daniel Kahneman and Amos Tversky, the founders of behavioral economics (see Michael Lewis’ The Undoing Project for a terrific summary of this groundbreaking collaboration). 

Therefore, I urge politicians, pundits, and anyone else with a desire to use the forthcoming NAEP release as a political cudgel to commit now, before you see the results, to predict the score patterns and provide a rationale for these predictions. But even the most well-thought-out hypothesis is still only a hypothesis. All people must be open to refining their hypotheses when more data are available. When the results are released, we should evaluate the predictions with humility and resist the urge to make cheap political “just so” points. Our collective focus needs to be on doing all we can to support accelerated student learning.

Share: