This is the first in a series of posts highlighting key pieces of work from the Center’s first twenty years. Each post will feature a document, set of tools, or body of work in areas such as large-scale assessment, accountability systems, growth, educator evaluation, learning progressions, and assessment systems. In keeping with the Center’s 20th anniversary theme, Leveraging the Lessons of the Past, our goal is to apply the lessons learned from this past work to help us improve assessment and accountability practices for the future.
At the time of the Center’s founding in 1998, the reliability of state accountability systems was a major focus of co-founder, Rich Hill. Initially, his interest was piqued by a 1996 statement by Eva Baker and Bob Linn making the argument that in measuring the progress of schools, “fluctuations due to differences in the students themselves could swamp differences in instructional efforts.” As states began holding schools accountable for the percentage of students Proficient in a given year (status) and the change in that percentage across one or two years (improvement), Rich believed that it was important that states understand the sources of error in estimates of school status and improvement, their impact on fluctuations in school performance from year-to-year, and consequently, the impact of those sources of error on schools meeting or not meeting state accountability targets. In particular, he believed it was important for states to understand, “which factors make a big difference and which ones don’t – and to have some idea of what the reliability of a particular design will be – so that consequences of the accountability system can be proportional to the reliability of the system.”
From 1998 to 2003, in a series of papers presented at conferences such as RILS, AERA, NCME, and the CCSSO Large-Scale Assessment Conference, Rich began to examine the reliability of scores being computed in typical state accountability systems. Combining Monte Carlo simulations and direct computations based on actual state assessment results, he addressed issues such as sampling error vs. measurement error, the reliability of status vs improvement scores, and the impact of making multiple dichotomous comparisons. Over time, he addressed practical issues such as the pros and cons associated with increasing the number of students included in calculations (i.e., testing at more grade levels) and the impact on reliability of the distance between school performance and its annual target. In these papers, he addressed the practical issue of school misclassifications through questions such as
- If every school in the state made no real improvement, what percentage of them would show observed gains that equal or exceed their goal, simple based on the random fluctuations that occur in schools’ observed scores from year to year?
- If every school in the state made improvements exactly equal to its goal, what percentage of them would show observed gains that equal or exceed their goal? (spoiler alert: exactly half)
- If every school in the state made real improvement equal to twice their goal, what percentage of them would show observed gains that equal or exceed their goal?
By the time I joined the Center in July 2002, the accountability requirements of No Child Left Behind were becoming clear: adequate yearly progress, annual measurable objectives, multiple comparisons and conjunctive decisions based on subgroup performance across two content areas, annual testing of students at grades 3 through 8, and the nominal goal of 100 percent of students performing at the Proficient level by 2014. Consequently, the focus of the analyses shifted to the reliability of no child left behind designs. This work culminated in the Hill and DePascale 2003 article, Reliability of No Child Left Behind Accountability Designs, published in Educational Measurement: Issues and Practice (Volume 22, Issue 3, pp 12-20).
A major focus of the 2003 article was the additional requirement of No Child Left Behind for states to establish accountability systems that were both valid and reliable. Through simulations and direct computations, we argued, “If one follow the language of the law literally, there is no design that will meet both requirements.” In short, as a state attempts to improve reliability by increasing the minimum number of students required to include a subgroup in the accountability system, more of the very students Title 1 and state accountability systems were designed to serve would be excluded from the system – a statistical Catch-22.
Of course, knowing that a literal interpretation of the law will not yield a valid and reliable system frees the mind to explore more flexible interpretations. In the initial approvals of state accountability systems under No Child Left Behind, the U.S. Department of Education appeared to be receptive to such flexibility, but that flexibility was short-lived.
As states now begin to implement their new ESSA accountability systems, consider how to combine information across interim and summative assessments, consider how to combine information from state and local assessment systems, and to develop innovative, personalized systems of instruction, assessment, and accountability, the key issues remain the same:
- It is critical that states understand which factors make a big difference and which ones don’t – and to have some idea of what the reliability of a particular design will be – so that consequences of the accountability system can be proportional to the reliability of the system.”
- Privilege validity in the design of accountability systems and the design/selection of the assessment systems that will feed them.
- Create a system that leads to reliable judgments, even if that requires some flexibility such as examining school improvement over longer periods of time.
- Understand, acknowledge, and carefully consider the real costs of Type I and Type II error and attempt to reach a reasonable balance between the two.