The Burden of Proof: A Call for Validation Plans and Evidence in Educational Programs

Sep 13, 2018

Why Evaluation of Educational System Designs is Critical to Measuring Effectiveness and Results

Educational policy makers, program designers, and intervention developers typically identify a problem and propose a solution to that problem. Likely, they have a lot of experience and expertise that informs the design of the solution to that problem–but how do they know the assessment design achieved the intended outcomes? 

When it comes to educational assessment systems, we should be asking ourselves two key questions: 

  • What’s the intended purpose of the assessment?
  • What evidence serves to confirm that we’ve achieved our purpose?

Throughout the design of educational assessment and accountability systems under the Improving America’s Schools ActNo Child Left BehindRace to the Top, ESEA Flexibility Waivers, and the Every Student Succeeds Act, states have proposed, developed, and implemented programs that should have resulted in more effective schools, improved student outcomes, and better teaching and learning. However, evaluation of these claims tends to happen after the fact, if at all. 

While evaluation may be a “very old practice” (Scriven, 1996, p. 395), there is some debate about its age as a formal discipline. Evaluation has been defined as judging the merit of something (Scriven, 1991). It has also been defined as a systematic process that determines the worth of a strategy in a specific context (Guskey, 2000). Both of these aging, but still relevant, definitions are rooted in judgment and evidence. Any judgment is dependent on not only evidence, but also the synthesis of that evidence to determine whether we’ve achieved the purpose of a system or strategy. 

The staff members at the National Center for the Improvement of Educational Assessment range in tenure from a little over a year to its founding 20 years ago. Despite this range, each of us tends to ask the same question when evaluating a program or system of interest: how do we know it’s working? Most practitioners, implementers, and designers ask this question, too.  Far too often, however, the question is asked too late. 

Reinforcing the Importance of Asking the Question Early and Often

The Center for Assessment’s 20th Annual Reidy Interactive Lecture Series (RILS) offers us an opportunity to take a look back and a look ahead. We have learned a lot about creating and supporting validation arguments over the past 20 years. The Center’s session at this year’s event on validation addresses the need to ask this question early in the design process and describes the validation process through three applications related to system coherence, assessment transition, and accountability system development. 

One of our major goals for RILS 2018 is to leverage the lessons of the past to forge an ambitious agenda for ways in which validity arguments and their supporting evaluation agendas can enhance equitable and excellent learning and life opportunities for all students.

Looking backward, practitioners and policy makers will quickly realize that strong validation agendas are difficult to design, let alone implement. However, we would argue that creating a validation agenda after the fact adds substantially greater complexity. We address this point in the following foci of our session: 

  • For assessment systems (this is where we have the most experience collecting evidence as a function of peer review, validity arguments, and technical reports)  
  • For accountability systems (where we are in the early stages of formal validity arguments and evidence)
  • For SEA programs across initiatives (where we likely the least amount of formal evidence)

Looking forward, we believe there are various shifts emerging that will necessitate states to collect information systematically in service of supporting validity agendas. These shifts include a movement from 

  • advocacy to evidence;
  • components to systems;
  • state-driven to locally-supported evaluation and validation activities; and
  • specialized, one-time studies to systemic validation and self-evaluation capacities.

These shifts will be challenged by new issues and old issues in new contexts, including evolving federal requirements, changing state requirements, and new state-articulated goals and priorities. 

The evolution of assessment technologies, attending to process data, measurement models, and the merging of diagnostic and summative assessment from multiple sources will increase the complexity of the validity questions we are asking and the evidence needed to answer those questions. This evolution will also require an emphasis on mixed-methods and combined formative-summative evaluation efforts that can focus on context-dependent needs and strategies. 

Overcoming Assessment and Accountability Challenges with a Focus on Evaluation

Given these substantial challenges, we propose frameworks that can help orient our assessment and accountability system designs with an eye toward validation efforts. By focusing on an evaluative perspective from the onset, we can increase the likelihood of capturing evidence that confirms (or disconfirms) that we have developed coherent educational programs, high-quality assessment systems, and accountability systems that incentivize behaviors and attitudes as intended. 

We recognize that these challenges are considerable and difficult to tackle, but the combined efforts and thinking of the Center and our partners can lead to strides in research and practice that can validate design decisions or support revisions in the future. 

We are producing a series of papers related to the challenges and opportunities associated with validation efforts in system coherence, assessment continuity, and accountability system design and we have organized several sets of concurrent sessions at RILS to allow participants the opportunity to engage with colleagues to better understand how we can overcome barriers in our work. 

We look forward to lively, engaging, and productive discussions when we gather together in Portsmouth, New Hampshire on September 27th and 28th.

For more information and to register for RILS 2018, please visit 2018 Conference page on the Center website.


Gusky, T.R. (2000). Evaluating professional development. Thousand Oaks, CA: Corwin Press, Inc.

Scriven, M. (1991). Evaluation Thesaurus (4th eds.). Newbury Park, CA: Sage.

Scriven, M. (1996). The theory behind practical evaluation. Evaluation, 2(4), 393-404.