Face It: Tests Have Consequences
Evaluating the Validity of Through-Year Assessments
Educational measurement professionals continue to debate something that seems wonky but is a big deal: whether to consider an assessment’s consequences when evaluating its validity. The role of consequences in test validity is important because it gets at the heart of how tests are used and whether their potentially negative or positive impacts should matter when we evaluate whether they produce valid inferences about student performance.
Validity evaluations require evidence about whether the test content matches what students are supposed to have learned and whether they engage in the types of thinking processes test designers intend. But a test’s consequences are just as important as those other dimensions of test quality. Test users need to know if a test they plan to use has a history of causing harm in addition to—or instead of—the good it is designed to support.
Even though legends in our field like Lee Cronbach, Sam Messick, Michael Kane, Lorrie Shepard, Bob Linn, and Ed Haertel have, since the early 1970s, emphasized the importance of consequences in evaluating test validity, some measurement professionals continue to resist. Measurement experts are still parsing phrasing in these earlier texts, like legal scholars debating amendments in the Bill of Rights.
Suzanne Lane and I don’t pull any punches in our validity chapter in the forthcoming 5th edition of Educational Measurement (often called the bible of testing). We are direct and clear in arguing that the consequences of testing are central to the evaluation of test validity. The current interest in through-year assessments illustrates why we must pay attention to test consequences when we evaluate the quality of testing programs.
Rising Interest in Through-Year Assessments Raises the Stakes
More than a dozen states are considering, designing, or implementing through-year assessment systems. Assessments in a through-year system are given in multiple, distinct administrations during a school year, resulting in a summative determination, such as a total score and/or proficiency determination, for students. Nearly all through-year systems are designed to address at least one additional goal, such as supporting instruction. (To learn more about the complexities and different permutations of through-year assessment, see our recently published paper on through-year assessment, as well as many other blog posts and resources on the Center for Assessment’s website.)
The current interest in through-year assessment systems is based on strong assumptions that such assessments can serve multiple purposes. These assumptions lead, in turn, to claims that the tests can support multiple inferences and uses, typically for both accountability and instructional utility.
But strong assumptions require strong evidence. Is there evidence to support the ambitious assumptions that are driving through-year assessment? A validity evaluation that includes consequential investigations is a crucial way to answer this question. (I work through an entire validity argument in my chapter, “Validity Arguments for Through-Year Assessments,” in the 4th edition of the International Encyclopedia of Education. Here I focus only on the consequences.)
Through-year designs are often proposed to address legitimate concerns with existing systems and to bring about positive changes (consequences), such as providing information that’s more instructionally useful to teachers. These designs are, in many ways, operating as interventions and should be evaluated as such.
In addition to documenting positive consequences, we must search for and evaluate potential unintended negative consequences as a core part of our validity evaluation. For example, an investigation might find that teachers are unable to interpret the score reports provided during the year and end up missing opportunities to help students, or worse, that the tests or reports simply don’t provide the information that teachers need to help students succeed.
Strong Assessment Claims Demand Strong Evidence
A well-developed theory of action for through-year assessments—or any assessment, for that matter—should describe how the system will support improved instruction. The theory of action might specify that teachers should be able to gain insights at a small enough grain size to target skills and knowledge students had not yet grasped, or identify what students need to learn next to maximize their progress.
The theory of action is often the first step in a validity evaluation that helps us unpack specific claims, such as whether:
- Through-year test components will yield results at a grain size sufficient to inform instructional decisions
- Teachers will be able to interpret and use the results from the through-year components to appropriately adjust instruction for individual students and for identifiable student groups
- Teachers will not engage in test preparation or other activities that distract from the instructional aims of the assessment system.
These are just a few examples of claims associated with through-year systems. Importantly, we can empirically evaluate these and similar claims to understand the degree to which the evidence supports or refutes them. Of course, we can and should further articulate these claims to interrogate phrases such as “instructional decisions.” Do teachers’ instructional decisions involve simply putting students into homogeneous groups, or do they represent more ambitious instructional strategies, such as tailored formative feedback to each student, tied to the current curriculum-based learning goals?
These three example claims are obviously consequential. If the evidence supports these assertions, we could begin to conclude that the program is having positive impacts on students and teachers. But if the evidence refutes these claims, we will rightfully worry about the assessment’s unintended negative consequences.
The Urgency of Collecting Evidence on Through-Year Assessment
Like anyone designing a new statewide assessment, those advocating through-year systems are responsible for producing evidence to support their major assumptions and to evaluate this evidence fairly. I have not yet seen the evidence or logical arguments needed to support many of the assumptions and claims being made in through-year designs.
Perhaps that is because through-year assessment systems are too new to generate the necessary evidence. But since their mixture of instructional and accountability uses in one system carries the significant potential for negative consequences, there is an urgency to begin constructing an interpretative argument and collecting evidence, especially regarding both positive and negative consequences.
Am I picking on through-year assessments unfairly? I don’t think so. I and many others have been consistent in our calls for consequential evidence to count in essentially all validity evaluations. The difference between through-year and typical end-of-year accountability tests is that most of our accountability test developers do not (or at least should not) make instructional claims.
Like the Pottery Barn saying, “If you break it, you buy it,” I’m saying, “If you claim it, you need to evaluate it.” If test developers make claims about instructional utility, they have a responsibility to evaluate these claims, including the positive and negative consequences tied to a variety of uses.
Note: I had the privilege of participating in a symposium on validity at the recent National Council on Measurement in Education conference in Chicago. Joining me were Suzanne Lane, emeritus professor at the University of Pittsburgh; Mike Russell, Boston College professor; and Daria Gerasimova, a University of Kansas researcher and former Center intern. This blog expands on my remarks at the symposium.