Please Make One Small Change in Federal Testing Law to Yield Big Improvements
State standardized tests are criticized for a number of key reasons, including the dearth of historically marginalized individuals and groups in the testing process, the lack of transparency about what’s on the test, the lack of rich and complex item types, and the oversized footprint of state tests. Importantly, conflating the tests with onerous accountability policies is another factor that contributes to negative attitudes about state tests. These issues can and should be addressed within the Every Student Succeeds Act (ESSA). However, one issue—limiting state summative tests’ ability to serve the purposes for which they are designed—will require a small, but important change in the current federal testing law. We contend that a minor legal revision to the next reauthorization of the Elementary and Secondary Education Act (ESEA) will lead to substantial improvements in state tests and to the perception of state testing.
A regular and legitimate complaint from teachers and principals is that state test results do not support teaching and learning. The results are often not returned in time, and even if student scores are produced instantly, the tests are at the wrong grain size (i.e., not tied to actionable learning targets), not tied to specific curriculum and instructional models, and generally not reported in ways that can guide instructional decisions.
Much to the dismay of many users, state summative tests are designed and validated for a very limited number of intended purposes, primarily to support monitoring educational trends and to serve as inputs for school accountability calculations. For many years, testing companies extolled the instructional benefits of state tests, which further frustrated local educators when these instructional claims were not supported.
We can’t fully blame testing companies or state leaders for these mixed messages about the appropriate uses of state test scores. Why? Because it is the law!
Federal Testing Law Sets an Impossible Standard
ESSA, as did previous instantiations of ESEA, mandates the impossible task of having state assessments provide “diagnostic” information for individual students. Clauses (x) and (xii) in Section 1111 (b)(2)(B) impose challenging requirements for state tests:
‘‘(x) produce individual student interpretive, descriptive, and diagnostic reports, consistent with clause (iii), regarding achievement on such assessments…
(xii) enable itemized score analyses to be produced and reported…to local educational agencies and schools, so that parents, teachers, principals, other school leaders, and administrators can interpret and address the specific academic needs of students as indicated by the students’ achievement on assessment items…”
These requirements reify the misconception that accountability tests can serve instructional purposes.
How do States Meet this State Testing Requirement?
States address the Section 1111 requirements through the use of subscores for a limited number of sub-domains such as the number system, geometry, and algebraic reasoning in mathematics, and the comprehension of informational vs. literary text in English Language Arts. The U.S. Department of Education has signed off on this approach for more than 20 years. The reporting of subscores for these sub-domains, especially at the grain size for which they are operationalized, does not fully meet a strict reading of the law (i.e., provision of “diagnostic reports” to “interpret and address specific academic needs”), but it appears to be a “wink and a nod” sort of acceptance.
Even though users would like to get distinct information about a student’s performance in algebraic reasoning compared to their understanding of the number system, state tests are built and analyzed in such a way that the items best equipped to make such distinctions are actually less likely to be included on the test in the first place. Further, since the reliability of any test score is related to the number of items on the test, a subscore based on fewer items will always be less reliable than the total score. In short, subscores generally do not provide enough unique and fine-grained information to support individual student diagnoses, and the information they do provide is not that reliable.
Including subscores and associated reporting on state tests requires these tests to be longer than tests without subscores because the test must be long enough to include enough items in each of these subscore domains to meet minimal levels of reliability; generally 10 items or score points. If our goal was just to produce a total score, we could lightly sample from each of these subdomains to ensure that the content standards are fully represented, but we could do so with a considerably shorter, yet still reliable, overall test.
The Implications of Lifting the Requirement for Individual Diagnostic Information
A shorter test that did not include subscores would help users realize that state summative tests are not designed to support instruction for specific students, hopefully leading district and state leaders to use these tests for the purposes to which they are most suited: monitoring educational trends. Recent analyses examining the impact of COVID-based interruptions on student learning demonstrated the utility of state tests for such purposes.
Lifting the requirement for individual “diagnoses” could make space for other test designs that might be more effective for program and curriculum evaluation and inspire more ambitious teaching practices. For example, having a shorter test could free up space to allow such things as matrix-sampled performance tasks to support evaluations of deeper learning and signal the types of activities state leaders would like to see in classrooms.
States could fill the apparent gap from the elimination of subscores by supporting the development, selection, and use of resources such as modular interim assessments and formative tools that can more directly inform teachers and leaders about student performance in these specific domains. In other words, if I want to know how well students are doing in their understanding of the number system and apart from their facility with algebraic reasoning, I should use tests specifically designed to measure these different attributes.
Be Clear About Test Purposes and Uses
Increasing assessment literacy is challenging enough without the federal government contributing to the confusion about the role of large-scale, state summative tests. We urge legislators and Congressional staffers to strike clauses (x) and (xii) from Section 1111 (b)(2)(B) to help clarify that these tests are designed for monitoring and accountability. State summative tests cannot also support the instructional needs of individual students.
Derek Briggs is a Professor of Education at the University of Colorado Boulder and Director of the Center for Assessment, Design, Research and Evaluation (CADRE). In addition to serving on many technical advisory committees and other expert panels, Derek is the immediate Past President of the National Council of Measurement in Education. Derek recently published an important new book: Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies.